1 Import Libraries

library(dplyr)
library(ggplot2)
library(broom)
library(janitor)
library(renv)
library(purrr)
library(tm)
library(SnowballC)
library(RColorBrewer)
library(ggplot2)
library(wordcloud)
library(biclust)
library(cluster)
library(igraph)
library(fpc)
library(magrittr)
library(rmarkdown)
library(textreuse)
library(slam)



library(plotly)
library(htmltools)
library(klaR)
library(tidyr)
library(stringr)

2 Introduction

One of the most important tasks of Natural Language Processing is text similarity. Text Similarity is the process of comparing a piece of text with another and finding the similarity between them. It’s basically about determining the degree of closeness of the text.

For this purpose, we chose a dataset of speeches from american presidents. Using Natural Language processing tools we firstly converted the data to DataFrame. Then we started to preprocess the data, firstly uniforming the documents using by removing punctuation, numbers and transforming it to lowercase. Then we continue with basic NLP text modifications like removal of stop words, tokenizing, lemmatization.

On the next step we used “tm” library tools in order to find the term similarity and later on we separated each speech into separate documents and used “textresuse” library to measure the document similarity and additionally, we visualized the results each time.

2.1 Describe Dataset

The dataset used for this project is president speeches obtained from this link.

Using the following script in Python, we first created a dataframe of the website’s speeches:

import requests
from bs4 import BeautifulSoup
import pandas as pd

# Scrapes transcripts for inaugural addresses


def get_urls(url):
    '''Returns list of transcript urls'''
    
    page = requests.get(url).text
    soup=BeautifulSoup(page, 'lxml')
    url_table = soup.find("table", class_='table').find_all("a")
    return [u["href"] for u in url_table]

urls = get_urls("https://www.presidency.ucsb.edu/documents/presidential-documents-archive-guidebook/inaugural-addresses")

transcripts = pd.DataFrame()

def get_transcripts(urls, transcripts):
    for u in urls:
        page = requests.get(u).text
        soup = BeautifulSoup(page, 'lxml')
        t_president = soup.find("h3", class_="diet-title").text
        t_year = soup.find("span", class_="date-display-single").text.split(',')[1].strip()
        t_content = soup.find("div", class_="field-docs-content").text
        record = {
            'president' : t_president,
            'year' : t_year,
            'content' : t_content
        }
        transcripts = transcripts.append(record, ignore_index=True)
    return transcripts

data = get_transcripts(urls,transcripts)
data.to_csv("us_presidents_transcripts.csv", sep="|")

In what follows, we load the dataframe:

df <- read.csv("https://raw.githubusercontent.com/berserkhmdvhb/MADS-NLP/main/data/presidents-speech.csv")
df |> dplyr::glimpse()
## Rows: 59
## Columns: 4
## $ X         <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17…
## $ president <chr> "George Washington", "George Washington", "John Adams", "Tho…
## $ year      <int> 1789, 1793, 1797, 1801, 1805, 1809, 1813, 1817, 1821, 1825, …
## $ content   <chr> "\nFellow-Citizens of the Senate and of the House of Represe…

In our datframe, we have 4 columns, X which is the index, president which displays the name of the presidents. Year which shows the year in which the president gave the speech and last one is the content. In the content field we have the content for each speech.

Below we check some details about the dataframe we created. It has 59 records. The earliest speech it has dates in 1789 and the latest in 2021.

df |> summary()
##        X         president              year        content         
##  Min.   : 0.0   Length:59          Min.   :1789   Length:59         
##  1st Qu.:14.5   Class :character   1st Qu.:1847   Class :character  
##  Median :29.0   Mode  :character   Median :1905   Mode  :character  
##  Mean   :29.0                      Mean   :1905                     
##  3rd Qu.:43.5                      3rd Qu.:1963                     
##  Max.   :58.0                      Max.   :2021

In what follows, text files are generated from each row of dataframe and are stored in “texts” folder:

#presidents <- df[["president"]]|> unique() |>as.list()

for(i in 1:nrow(df)) {       # for-loop over rows
  df_i <- df[i, ]
  name <- df_i$president
  year <- df_i$year
  text <- df_i$content |> stringr::str_trim()
  file_name <- paste(as.character(year), 
                     as.character(name), 
                     sep="-")
  file_name <- paste(file_name, ".txt", 
                     sep="")
  loc <- paste("./data/texts/", file_name, sep="")
  writeLines(text, loc)
}  
loc <- "./data/texts/"
docs <- tm::VCorpus(DirSource(loc)) 
summary(docs) 
##                                 Length Class             Mode
## 1789-George Washington.txt      2      PlainTextDocument list
## 1793-George Washington.txt      2      PlainTextDocument list
## 1797-John Adams.txt             2      PlainTextDocument list
## 1801-Thomas Jefferson.txt       2      PlainTextDocument list
## 1805-Thomas Jefferson.txt       2      PlainTextDocument list
## 1809-James Madison.txt          2      PlainTextDocument list
## 1813-James Madison.txt          2      PlainTextDocument list
## 1817-James Monroe.txt           2      PlainTextDocument list
## 1821-James Monroe.txt           2      PlainTextDocument list
## 1825-John Quincy Adams.txt      2      PlainTextDocument list
## 1829-Andrew Jackson.txt         2      PlainTextDocument list
## 1833-Andrew Jackson.txt         2      PlainTextDocument list
## 1837-Martin van Buren.txt       2      PlainTextDocument list
## 1841-William Henry Harrison.txt 2      PlainTextDocument list
## 1845-James K. Polk.txt          2      PlainTextDocument list
## 1849-Zachary Taylor.txt         2      PlainTextDocument list
## 1853-Franklin Pierce.txt        2      PlainTextDocument list
## 1857-James Buchanan.txt         2      PlainTextDocument list
## 1861-Abraham Lincoln.txt        2      PlainTextDocument list
## 1865-Abraham Lincoln.txt        2      PlainTextDocument list
## 1869-Ulysses S. Grant.txt       2      PlainTextDocument list
## 1873-Ulysses S. Grant.txt       2      PlainTextDocument list
## 1877-Rutherford B. Hayes.txt    2      PlainTextDocument list
## 1881-James A. Garfield.txt      2      PlainTextDocument list
## 1885-Grover Cleveland.txt       2      PlainTextDocument list
## 1889-Benjamin Harrison.txt      2      PlainTextDocument list
## 1893-Grover Cleveland.txt       2      PlainTextDocument list
## 1897-William McKinley.txt       2      PlainTextDocument list
## 1901-William McKinley.txt       2      PlainTextDocument list
## 1905-Theodore Roosevelt.txt     2      PlainTextDocument list
## 1909-William Howard Taft.txt    2      PlainTextDocument list
## 1913-Woodrow Wilson.txt         2      PlainTextDocument list
## 1917-Woodrow Wilson.txt         2      PlainTextDocument list
## 1921-Warren G. Harding.txt      2      PlainTextDocument list
## 1925-Calvin Coolidge.txt        2      PlainTextDocument list
## 1929-Herbert Hoover.txt         2      PlainTextDocument list
## 1933-Franklin D. Roosevelt.txt  2      PlainTextDocument list
## 1937-Franklin D. Roosevelt.txt  2      PlainTextDocument list
## 1941-Franklin D. Roosevelt.txt  2      PlainTextDocument list
## 1945-Franklin D. Roosevelt.txt  2      PlainTextDocument list
## 1949-Harry S. Truman.txt        2      PlainTextDocument list
## 1953-Dwight D. Eisenhower.txt   2      PlainTextDocument list
## 1957-Dwight D. Eisenhower.txt   2      PlainTextDocument list
## 1961-John F. Kennedy.txt        2      PlainTextDocument list
## 1965-Lyndon B. Johnson.txt      2      PlainTextDocument list
## 1969-Richard Nixon.txt          2      PlainTextDocument list
## 1973-Richard Nixon.txt          2      PlainTextDocument list
## 1977-Jimmy Carter.txt           2      PlainTextDocument list
## 1981-Ronald Reagan.txt          2      PlainTextDocument list
## 1985-Ronald Reagan.txt          2      PlainTextDocument list
## 1989-George Bush.txt            2      PlainTextDocument list
## 1993-William J. Clinton.txt     2      PlainTextDocument list
## 1997-William J. Clinton.txt     2      PlainTextDocument list
## 2001-George W. Bush.txt         2      PlainTextDocument list
## 2005-George W. Bush.txt         2      PlainTextDocument list
## 2009-Barack Obama.txt           2      PlainTextDocument list
## 2013-Barack Obama.txt           2      PlainTextDocument list
## 2017-Donald J. Trump.txt        2      PlainTextDocument list
## 2021-Joseph R. Biden.txt        2      PlainTextDocument list
inspect(docs[1])
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 1
## 
## [[1]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 8617

Here we check the content of document one which should refer to the speech by Goerge Washington in 1789. We will use the content from this document as a demonstation for the preprocessing part.

writeLines(as.character(docs[1]))
## list(list(content = c("Fellow-Citizens of the Senate and of the House of Representatives:", "Among the vicissitudes incident to life no event could have filled me with greater anxieties than that of which the notification was transmitted by your order, and received on the 14th day of the present month. On the one hand, I was summoned by my country, whose voice I can never hear but with veneration and love, from a retreat which I had chosen with the fondest predilection, and, in my flattering hopes, with an immutable decision, as the asylum of my declining years—a retreat which was rendered every day more necessary as well as more dear to me by the addition of habit to inclination, and of frequent interruptions in my health to the gradual waste committed on it by time. On the other hand, the magnitude and difficulty of the trust to which the voice of my country called me, being sufficient to awaken in the wisest and most experienced of her citizens a distrustful scrutiny into his qualifications, could not but overwhelm with despondence one who (inheriting inferior endowments from nature and unpracticed in the duties of civil administration) ought to be peculiarly conscious of his own deficiencies. In this conflict of emotions all I dare aver is that it has been my faithful study to collect my duty from a just appreciation of every circumstance by which it might be affected. All I dare hope is that if, in executing this task, I have been too much swayed by a grateful remembrance of former instances, or by an affectionate sensibility to this transcendent proof of the confidence of my fellow-citizens, and have thence too little consulted my incapacity as well as disinclination for the weighty and untried cares before me, my error will be palliated by the motives which mislead [see APP note] me, and its consequences be judged by my country with some share of the partiality in which they originated.", 
## "Such being the impressions under which I have, in obedience to the public summons, repaired to the present station, it would be peculiarly improper to omit in this first official act my fervent supplications to that Almighty Being who rules over the universe, who presides in the councils of nations, and whose providential aids can supply every human defect, that His benediction may consecrate to the liberties and happiness of the people of the United States a Government instituted by themselves for these essential purposes, and may enable every instrument employed in its administration to execute with success the functions allotted to his charge. In tendering this homage to the Great Author of every public and private good, I assure myself that it expresses your sentiments not less than my own, nor those of my fellow-citizens at large less than either. No people can be bound to acknowledge and adore the Invisible Hand which conducts the affairs of men more than those of the United States. Every step by which they have advanced to the character of an independent nation seems to have been distinguished by some token of providential agency; and in the important revolution just accomplished in the system of their united government the tranquil deliberations and voluntary consent of so many distinct communities from which the event has resulted can not be compared with the means by which most governments have been established without some return of pious gratitude, along with an humble anticipation of the future blessings which the past seem to presage. These reflections, arising out of the present crisis, have forced themselves too strongly on my mind to be suppressed. You will join with me, I trust, in thinking that there are none under the influence of which the proceedings of a new and free government can more auspiciously commence.", 
## "By the article establishing the executive department it is made the duty of the President \"to recommend to your consideration such measures as he shall judge necessary and expedient.\" The circumstances under which I now meet you will acquit me from entering into that subject further than to refer to the great constitutional charter under which you are assembled, and which, in defining your powers, designates the objects to which your attention is to be given. It will be more consistent with those circumstances, and far more congenial with the feelings which actuate me, to substitute, in place of a recommendation of particular measures, the tribute that is due to the talents, the rectitude, and the patriotism which adorn the characters selected to devise and adopt them. In these honorable qualifications I behold the surest pledges that as on one side no local prejudices or attachments, no separate views nor party animosities, will misdirect the comprehensive and equal eye which ought to watch over this great assemblage of communities and interests, so, on another, that the foundation of our national policy will be laid in the pure and immutable principles of private morality, and the preeminence of free government be exemplified by all the attributes which can win the affections of its citizens and command the respect of the world. I dwell on this prospect with every satisfaction which an ardent love for my country can inspire, since there is no truth more thoroughly established than that there exists in the economy and course of nature an indissoluble union between virtue and happiness; between duty and advantage; between the genuine maxims of an honest and magnanimous policy and the solid rewards of public prosperity and felicity; since we ought to be no less persuaded that the propitious smiles of Heaven can never be expected on a nation that disregards the eternal rules of order and right which Heaven itself has ordained; and since the preservation of the sacred fire of liberty and the destiny of the republican model of government are justly considered, perhaps, as deeply, as finally, staked on the experiment entrusted to the hands of the American people.", 
## "Besides the ordinary objects submitted to your care, it will remain with your judgment to decide how far an exercise of the occasional power delegated by the fifth article of the Constitution is rendered expedient at the present juncture by the nature of objections which have been urged against the system, or by the degree of inquietude which has given birth to them. Instead of undertaking particular recommendations on this subject, in which I could be guided by no lights derived from official opportunities, I shall again give way to my entire confidence in your discernment and pursuit of the public good; for I assure myself that whilst you carefully avoid every alteration which might endanger the benefits of an united and effective government, or which ought to await the future lessons of experience, a reverence for the characteristic rights of freemen and a regard for the public harmony will sufficiently influence your deliberations on the question how far the former can be impregnably fortified or the latter be safely and advantageously promoted.", 
## "To the foregoing observations I have one to add, which will be most properly addressed to the House of Representatives. It concerns myself, and will therefore be as brief as possible. When I was first honored with a call into the service of my country, then on the eve of an arduous struggle for its liberties, the light in which I contemplated my duty required that I should renounce every pecuniary compensation. From this resolution I have in no instance departed; and being still under the impressions which produced it, I must decline as inapplicable to myself any share in the personal emoluments which may be indispensably included in a permanent provision for the executive department, and must accordingly pray that the pecuniary estimates for the station in which I am placed may during my continuance in it be limited to such actual expenditures as the public good may be thought to require.", 
## "Having thus imparted to you my sentiments as they have been awakened by the occasion which brings us together, I shall take my present leave; but not without resorting once more to the benign Parent of the Human Race in humble supplication that, since He has been pleased to favor the American people with opportunities for deliberating in perfect tranquillity, and dispositions for deciding with unparalleled unanimity on a form of government for the security of their union and the advancement of their happiness, so His divine blessing may be equally conspicuous in the enlarged views, the temperate consultations, and the wise measures on which the success of this Government must depend."
## ), meta = list(author = character(0), datetimestamp = list(sec = 18.2820847034454, min = 14, hour = 18, mday = 7, mon = 0, year = 123, wday = 6, yday = 6, isdst = 0), description = character(0), heading = character(0), id = "1789-George Washington.txt", language = "en", origin = character(0))))
## list()
## list()

2.2 Goal and Procedure

This project is dedicated to investigating text similarity between speeches from different presidents of US during various years, starting from 1789 and ending with 2021.

In Preprocessing section, numerous text mining tasks are implemented on all the documents.

In Term Similarity section, frequency of different terms in documents are analyzed and visualized.

In Doc Similarity, similarity between documents is measured, analyzed, and visualized.

In Conclusion, main findings are summarized.

The github repository for this package can be found in this link

3 Preprocessing

The tm is a framework for text mining applications within R. Most functions used henceforth stems from this package.

3.1 Remove punctuation

The punctuation removal process will help to treat each text equally. For example, the word data and data! are treated equally after the process of removal of punctuations. After the removal we print the content of the first document one more time and check the results. The sentences are devided by , and are within quotes ,but inside the quotes the punctuation is removed.

docs <- tm::tm_map(docs,removePunctuation)   
writeLines(as.character(docs[1])) 
## list(list(content = c("FellowCitizens of the Senate and of the House of Representatives", "Among the vicissitudes incident to life no event could have filled me with greater anxieties than that of which the notification was transmitted by your order and received on the 14th day of the present month On the one hand I was summoned by my country whose voice I can never hear but with veneration and love from a retreat which I had chosen with the fondest predilection and in my flattering hopes with an immutable decision as the asylum of my declining years—a retreat which was rendered every day more necessary as well as more dear to me by the addition of habit to inclination and of frequent interruptions in my health to the gradual waste committed on it by time On the other hand the magnitude and difficulty of the trust to which the voice of my country called me being sufficient to awaken in the wisest and most experienced of her citizens a distrustful scrutiny into his qualifications could not but overwhelm with despondence one who inheriting inferior endowments from nature and unpracticed in the duties of civil administration ought to be peculiarly conscious of his own deficiencies In this conflict of emotions all I dare aver is that it has been my faithful study to collect my duty from a just appreciation of every circumstance by which it might be affected All I dare hope is that if in executing this task I have been too much swayed by a grateful remembrance of former instances or by an affectionate sensibility to this transcendent proof of the confidence of my fellowcitizens and have thence too little consulted my incapacity as well as disinclination for the weighty and untried cares before me my error will be palliated by the motives which mislead see APP note me and its consequences be judged by my country with some share of the partiality in which they originated", 
## "Such being the impressions under which I have in obedience to the public summons repaired to the present station it would be peculiarly improper to omit in this first official act my fervent supplications to that Almighty Being who rules over the universe who presides in the councils of nations and whose providential aids can supply every human defect that His benediction may consecrate to the liberties and happiness of the people of the United States a Government instituted by themselves for these essential purposes and may enable every instrument employed in its administration to execute with success the functions allotted to his charge In tendering this homage to the Great Author of every public and private good I assure myself that it expresses your sentiments not less than my own nor those of my fellowcitizens at large less than either No people can be bound to acknowledge and adore the Invisible Hand which conducts the affairs of men more than those of the United States Every step by which they have advanced to the character of an independent nation seems to have been distinguished by some token of providential agency and in the important revolution just accomplished in the system of their united government the tranquil deliberations and voluntary consent of so many distinct communities from which the event has resulted can not be compared with the means by which most governments have been established without some return of pious gratitude along with an humble anticipation of the future blessings which the past seem to presage These reflections arising out of the present crisis have forced themselves too strongly on my mind to be suppressed You will join with me I trust in thinking that there are none under the influence of which the proceedings of a new and free government can more auspiciously commence", 
## "By the article establishing the executive department it is made the duty of the President to recommend to your consideration such measures as he shall judge necessary and expedient The circumstances under which I now meet you will acquit me from entering into that subject further than to refer to the great constitutional charter under which you are assembled and which in defining your powers designates the objects to which your attention is to be given It will be more consistent with those circumstances and far more congenial with the feelings which actuate me to substitute in place of a recommendation of particular measures the tribute that is due to the talents the rectitude and the patriotism which adorn the characters selected to devise and adopt them In these honorable qualifications I behold the surest pledges that as on one side no local prejudices or attachments no separate views nor party animosities will misdirect the comprehensive and equal eye which ought to watch over this great assemblage of communities and interests so on another that the foundation of our national policy will be laid in the pure and immutable principles of private morality and the preeminence of free government be exemplified by all the attributes which can win the affections of its citizens and command the respect of the world I dwell on this prospect with every satisfaction which an ardent love for my country can inspire since there is no truth more thoroughly established than that there exists in the economy and course of nature an indissoluble union between virtue and happiness between duty and advantage between the genuine maxims of an honest and magnanimous policy and the solid rewards of public prosperity and felicity since we ought to be no less persuaded that the propitious smiles of Heaven can never be expected on a nation that disregards the eternal rules of order and right which Heaven itself has ordained and since the preservation of the sacred fire of liberty and the destiny of the republican model of government are justly considered perhaps as deeply as finally staked on the experiment entrusted to the hands of the American people", 
## "Besides the ordinary objects submitted to your care it will remain with your judgment to decide how far an exercise of the occasional power delegated by the fifth article of the Constitution is rendered expedient at the present juncture by the nature of objections which have been urged against the system or by the degree of inquietude which has given birth to them Instead of undertaking particular recommendations on this subject in which I could be guided by no lights derived from official opportunities I shall again give way to my entire confidence in your discernment and pursuit of the public good for I assure myself that whilst you carefully avoid every alteration which might endanger the benefits of an united and effective government or which ought to await the future lessons of experience a reverence for the characteristic rights of freemen and a regard for the public harmony will sufficiently influence your deliberations on the question how far the former can be impregnably fortified or the latter be safely and advantageously promoted", 
## "To the foregoing observations I have one to add which will be most properly addressed to the House of Representatives It concerns myself and will therefore be as brief as possible When I was first honored with a call into the service of my country then on the eve of an arduous struggle for its liberties the light in which I contemplated my duty required that I should renounce every pecuniary compensation From this resolution I have in no instance departed and being still under the impressions which produced it I must decline as inapplicable to myself any share in the personal emoluments which may be indispensably included in a permanent provision for the executive department and must accordingly pray that the pecuniary estimates for the station in which I am placed may during my continuance in it be limited to such actual expenditures as the public good may be thought to require", 
## "Having thus imparted to you my sentiments as they have been awakened by the occasion which brings us together I shall take my present leave but not without resorting once more to the benign Parent of the Human Race in humble supplication that since He has been pleased to favor the American people with opportunities for deliberating in perfect tranquillity and dispositions for deciding with unparalleled unanimity on a form of government for the security of their union and the advancement of their happiness so His divine blessing may be equally conspicuous in the enlarged views the temperate consultations and the wise measures on which the success of this Government must depend"
## ), meta = list(author = character(0), datetimestamp = list(sec = 18.2820847034454, min = 14, hour = 18, mday = 7, mon = 0, year = 123, wday = 6, yday = 6, isdst = 0), description = character(0), heading = character(0), id = "1789-George Washington.txt", language = "en", origin = character(0))))
## list()
## list()

3.2 Remove special characters

Secondly, we remove all special characters. For this purpose we use gsub which replaces the special characters dictated by us with space. We check the document one more time.

for (j in seq(docs)) {
    docs[[j]] <- gsub("/", " ", docs[[j]])
    docs[[j]] <- gsub("@", " ", docs[[j]])
    docs[[j]] <- gsub("\\|", " ", docs[[j]])
    docs[[j]] <- gsub("\u2028", " ", docs[[j]])  # This is an ascii character that did not translate, so it had to be removed.
}
writeLines(as.character(docs[1]))
## list(c("FellowCitizens of the Senate and of the House of Representatives", "Among the vicissitudes incident to life no event could have filled me with greater anxieties than that of which the notification was transmitted by your order and received on the 14th day of the present month On the one hand I was summoned by my country whose voice I can never hear but with veneration and love from a retreat which I had chosen with the fondest predilection and in my flattering hopes with an immutable decision as the asylum of my declining years—a retreat which was rendered every day more necessary as well as more dear to me by the addition of habit to inclination and of frequent interruptions in my health to the gradual waste committed on it by time On the other hand the magnitude and difficulty of the trust to which the voice of my country called me being sufficient to awaken in the wisest and most experienced of her citizens a distrustful scrutiny into his qualifications could not but overwhelm with despondence one who inheriting inferior endowments from nature and unpracticed in the duties of civil administration ought to be peculiarly conscious of his own deficiencies In this conflict of emotions all I dare aver is that it has been my faithful study to collect my duty from a just appreciation of every circumstance by which it might be affected All I dare hope is that if in executing this task I have been too much swayed by a grateful remembrance of former instances or by an affectionate sensibility to this transcendent proof of the confidence of my fellowcitizens and have thence too little consulted my incapacity as well as disinclination for the weighty and untried cares before me my error will be palliated by the motives which mislead see APP note me and its consequences be judged by my country with some share of the partiality in which they originated", 
## "Such being the impressions under which I have in obedience to the public summons repaired to the present station it would be peculiarly improper to omit in this first official act my fervent supplications to that Almighty Being who rules over the universe who presides in the councils of nations and whose providential aids can supply every human defect that His benediction may consecrate to the liberties and happiness of the people of the United States a Government instituted by themselves for these essential purposes and may enable every instrument employed in its administration to execute with success the functions allotted to his charge In tendering this homage to the Great Author of every public and private good I assure myself that it expresses your sentiments not less than my own nor those of my fellowcitizens at large less than either No people can be bound to acknowledge and adore the Invisible Hand which conducts the affairs of men more than those of the United States Every step by which they have advanced to the character of an independent nation seems to have been distinguished by some token of providential agency and in the important revolution just accomplished in the system of their united government the tranquil deliberations and voluntary consent of so many distinct communities from which the event has resulted can not be compared with the means by which most governments have been established without some return of pious gratitude along with an humble anticipation of the future blessings which the past seem to presage These reflections arising out of the present crisis have forced themselves too strongly on my mind to be suppressed You will join with me I trust in thinking that there are none under the influence of which the proceedings of a new and free government can more auspiciously commence", 
## "By the article establishing the executive department it is made the duty of the President to recommend to your consideration such measures as he shall judge necessary and expedient The circumstances under which I now meet you will acquit me from entering into that subject further than to refer to the great constitutional charter under which you are assembled and which in defining your powers designates the objects to which your attention is to be given It will be more consistent with those circumstances and far more congenial with the feelings which actuate me to substitute in place of a recommendation of particular measures the tribute that is due to the talents the rectitude and the patriotism which adorn the characters selected to devise and adopt them In these honorable qualifications I behold the surest pledges that as on one side no local prejudices or attachments no separate views nor party animosities will misdirect the comprehensive and equal eye which ought to watch over this great assemblage of communities and interests so on another that the foundation of our national policy will be laid in the pure and immutable principles of private morality and the preeminence of free government be exemplified by all the attributes which can win the affections of its citizens and command the respect of the world I dwell on this prospect with every satisfaction which an ardent love for my country can inspire since there is no truth more thoroughly established than that there exists in the economy and course of nature an indissoluble union between virtue and happiness between duty and advantage between the genuine maxims of an honest and magnanimous policy and the solid rewards of public prosperity and felicity since we ought to be no less persuaded that the propitious smiles of Heaven can never be expected on a nation that disregards the eternal rules of order and right which Heaven itself has ordained and since the preservation of the sacred fire of liberty and the destiny of the republican model of government are justly considered perhaps as deeply as finally staked on the experiment entrusted to the hands of the American people", 
## "Besides the ordinary objects submitted to your care it will remain with your judgment to decide how far an exercise of the occasional power delegated by the fifth article of the Constitution is rendered expedient at the present juncture by the nature of objections which have been urged against the system or by the degree of inquietude which has given birth to them Instead of undertaking particular recommendations on this subject in which I could be guided by no lights derived from official opportunities I shall again give way to my entire confidence in your discernment and pursuit of the public good for I assure myself that whilst you carefully avoid every alteration which might endanger the benefits of an united and effective government or which ought to await the future lessons of experience a reverence for the characteristic rights of freemen and a regard for the public harmony will sufficiently influence your deliberations on the question how far the former can be impregnably fortified or the latter be safely and advantageously promoted", 
## "To the foregoing observations I have one to add which will be most properly addressed to the House of Representatives It concerns myself and will therefore be as brief as possible When I was first honored with a call into the service of my country then on the eve of an arduous struggle for its liberties the light in which I contemplated my duty required that I should renounce every pecuniary compensation From this resolution I have in no instance departed and being still under the impressions which produced it I must decline as inapplicable to myself any share in the personal emoluments which may be indispensably included in a permanent provision for the executive department and must accordingly pray that the pecuniary estimates for the station in which I am placed may during my continuance in it be limited to such actual expenditures as the public good may be thought to require", 
## "Having thus imparted to you my sentiments as they have been awakened by the occasion which brings us together I shall take my present leave but not without resorting once more to the benign Parent of the Human Race in humble supplication that since He has been pleased to favor the American people with opportunities for deliberating in perfect tranquillity and dispositions for deciding with unparalleled unanimity on a form of government for the security of their union and the advancement of their happiness so His divine blessing may be equally conspicuous in the enlarged views the temperate consultations and the wise measures on which the success of this Government must depend"
## ))
## list()
## list()

3.3 Remove numbers

In this step, in order to make the text more uniform we remove all the numerical forms. For doing so, there exists a function from the tm library.

docs <- tm::tm_map(docs, removeNumbers)   
writeLines(as.character(docs[1])) 
## list(c("FellowCitizens of the Senate and of the House of Representatives", "Among the vicissitudes incident to life no event could have filled me with greater anxieties than that of which the notification was transmitted by your order and received on the th day of the present month On the one hand I was summoned by my country whose voice I can never hear but with veneration and love from a retreat which I had chosen with the fondest predilection and in my flattering hopes with an immutable decision as the asylum of my declining years—a retreat which was rendered every day more necessary as well as more dear to me by the addition of habit to inclination and of frequent interruptions in my health to the gradual waste committed on it by time On the other hand the magnitude and difficulty of the trust to which the voice of my country called me being sufficient to awaken in the wisest and most experienced of her citizens a distrustful scrutiny into his qualifications could not but overwhelm with despondence one who inheriting inferior endowments from nature and unpracticed in the duties of civil administration ought to be peculiarly conscious of his own deficiencies In this conflict of emotions all I dare aver is that it has been my faithful study to collect my duty from a just appreciation of every circumstance by which it might be affected All I dare hope is that if in executing this task I have been too much swayed by a grateful remembrance of former instances or by an affectionate sensibility to this transcendent proof of the confidence of my fellowcitizens and have thence too little consulted my incapacity as well as disinclination for the weighty and untried cares before me my error will be palliated by the motives which mislead see APP note me and its consequences be judged by my country with some share of the partiality in which they originated", 
## "Such being the impressions under which I have in obedience to the public summons repaired to the present station it would be peculiarly improper to omit in this first official act my fervent supplications to that Almighty Being who rules over the universe who presides in the councils of nations and whose providential aids can supply every human defect that His benediction may consecrate to the liberties and happiness of the people of the United States a Government instituted by themselves for these essential purposes and may enable every instrument employed in its administration to execute with success the functions allotted to his charge In tendering this homage to the Great Author of every public and private good I assure myself that it expresses your sentiments not less than my own nor those of my fellowcitizens at large less than either No people can be bound to acknowledge and adore the Invisible Hand which conducts the affairs of men more than those of the United States Every step by which they have advanced to the character of an independent nation seems to have been distinguished by some token of providential agency and in the important revolution just accomplished in the system of their united government the tranquil deliberations and voluntary consent of so many distinct communities from which the event has resulted can not be compared with the means by which most governments have been established without some return of pious gratitude along with an humble anticipation of the future blessings which the past seem to presage These reflections arising out of the present crisis have forced themselves too strongly on my mind to be suppressed You will join with me I trust in thinking that there are none under the influence of which the proceedings of a new and free government can more auspiciously commence", 
## "By the article establishing the executive department it is made the duty of the President to recommend to your consideration such measures as he shall judge necessary and expedient The circumstances under which I now meet you will acquit me from entering into that subject further than to refer to the great constitutional charter under which you are assembled and which in defining your powers designates the objects to which your attention is to be given It will be more consistent with those circumstances and far more congenial with the feelings which actuate me to substitute in place of a recommendation of particular measures the tribute that is due to the talents the rectitude and the patriotism which adorn the characters selected to devise and adopt them In these honorable qualifications I behold the surest pledges that as on one side no local prejudices or attachments no separate views nor party animosities will misdirect the comprehensive and equal eye which ought to watch over this great assemblage of communities and interests so on another that the foundation of our national policy will be laid in the pure and immutable principles of private morality and the preeminence of free government be exemplified by all the attributes which can win the affections of its citizens and command the respect of the world I dwell on this prospect with every satisfaction which an ardent love for my country can inspire since there is no truth more thoroughly established than that there exists in the economy and course of nature an indissoluble union between virtue and happiness between duty and advantage between the genuine maxims of an honest and magnanimous policy and the solid rewards of public prosperity and felicity since we ought to be no less persuaded that the propitious smiles of Heaven can never be expected on a nation that disregards the eternal rules of order and right which Heaven itself has ordained and since the preservation of the sacred fire of liberty and the destiny of the republican model of government are justly considered perhaps as deeply as finally staked on the experiment entrusted to the hands of the American people", 
## "Besides the ordinary objects submitted to your care it will remain with your judgment to decide how far an exercise of the occasional power delegated by the fifth article of the Constitution is rendered expedient at the present juncture by the nature of objections which have been urged against the system or by the degree of inquietude which has given birth to them Instead of undertaking particular recommendations on this subject in which I could be guided by no lights derived from official opportunities I shall again give way to my entire confidence in your discernment and pursuit of the public good for I assure myself that whilst you carefully avoid every alteration which might endanger the benefits of an united and effective government or which ought to await the future lessons of experience a reverence for the characteristic rights of freemen and a regard for the public harmony will sufficiently influence your deliberations on the question how far the former can be impregnably fortified or the latter be safely and advantageously promoted", 
## "To the foregoing observations I have one to add which will be most properly addressed to the House of Representatives It concerns myself and will therefore be as brief as possible When I was first honored with a call into the service of my country then on the eve of an arduous struggle for its liberties the light in which I contemplated my duty required that I should renounce every pecuniary compensation From this resolution I have in no instance departed and being still under the impressions which produced it I must decline as inapplicable to myself any share in the personal emoluments which may be indispensably included in a permanent provision for the executive department and must accordingly pray that the pecuniary estimates for the station in which I am placed may during my continuance in it be limited to such actual expenditures as the public good may be thought to require", 
## "Having thus imparted to you my sentiments as they have been awakened by the occasion which brings us together I shall take my present leave but not without resorting once more to the benign Parent of the Human Race in humble supplication that since He has been pleased to favor the American people with opportunities for deliberating in perfect tranquillity and dispositions for deciding with unparalleled unanimity on a form of government for the security of their union and the advancement of their happiness so His divine blessing may be equally conspicuous in the enlarged views the temperate consultations and the wise measures on which the success of this Government must depend"
## ))
## list()
## list()

3.4 Convert to lowercase

Again, serving the uniformity purposed we tranform all the uppercase to lowercase. Words like Book and book mean the same but when not converted to the lower case those two are represented as two different words in the vector space model (resulting in more dimensions).

Checking the first document below we see that now the first word of the speech, respectively “Felowcitizens” starts with a lowercase.

docs <- tm::tm_map(docs, tolower)
docs <- tm::tm_map(docs, PlainTextDocument)
DocsCopy <- docs
writeLines(as.character(docs[1])) 
## list(list(content = c("fellowcitizens of the senate and of the house of representatives", "among the vicissitudes incident to life no event could have filled me with greater anxieties than that of which the notification was transmitted by your order and received on the th day of the present month on the one hand i was summoned by my country whose voice i can never hear but with veneration and love from a retreat which i had chosen with the fondest predilection and in my flattering hopes with an immutable decision as the asylum of my declining years—a retreat which was rendered every day more necessary as well as more dear to me by the addition of habit to inclination and of frequent interruptions in my health to the gradual waste committed on it by time on the other hand the magnitude and difficulty of the trust to which the voice of my country called me being sufficient to awaken in the wisest and most experienced of her citizens a distrustful scrutiny into his qualifications could not but overwhelm with despondence one who inheriting inferior endowments from nature and unpracticed in the duties of civil administration ought to be peculiarly conscious of his own deficiencies in this conflict of emotions all i dare aver is that it has been my faithful study to collect my duty from a just appreciation of every circumstance by which it might be affected all i dare hope is that if in executing this task i have been too much swayed by a grateful remembrance of former instances or by an affectionate sensibility to this transcendent proof of the confidence of my fellowcitizens and have thence too little consulted my incapacity as well as disinclination for the weighty and untried cares before me my error will be palliated by the motives which mislead see app note me and its consequences be judged by my country with some share of the partiality in which they originated", 
## "such being the impressions under which i have in obedience to the public summons repaired to the present station it would be peculiarly improper to omit in this first official act my fervent supplications to that almighty being who rules over the universe who presides in the councils of nations and whose providential aids can supply every human defect that his benediction may consecrate to the liberties and happiness of the people of the united states a government instituted by themselves for these essential purposes and may enable every instrument employed in its administration to execute with success the functions allotted to his charge in tendering this homage to the great author of every public and private good i assure myself that it expresses your sentiments not less than my own nor those of my fellowcitizens at large less than either no people can be bound to acknowledge and adore the invisible hand which conducts the affairs of men more than those of the united states every step by which they have advanced to the character of an independent nation seems to have been distinguished by some token of providential agency and in the important revolution just accomplished in the system of their united government the tranquil deliberations and voluntary consent of so many distinct communities from which the event has resulted can not be compared with the means by which most governments have been established without some return of pious gratitude along with an humble anticipation of the future blessings which the past seem to presage these reflections arising out of the present crisis have forced themselves too strongly on my mind to be suppressed you will join with me i trust in thinking that there are none under the influence of which the proceedings of a new and free government can more auspiciously commence", 
## "by the article establishing the executive department it is made the duty of the president to recommend to your consideration such measures as he shall judge necessary and expedient the circumstances under which i now meet you will acquit me from entering into that subject further than to refer to the great constitutional charter under which you are assembled and which in defining your powers designates the objects to which your attention is to be given it will be more consistent with those circumstances and far more congenial with the feelings which actuate me to substitute in place of a recommendation of particular measures the tribute that is due to the talents the rectitude and the patriotism which adorn the characters selected to devise and adopt them in these honorable qualifications i behold the surest pledges that as on one side no local prejudices or attachments no separate views nor party animosities will misdirect the comprehensive and equal eye which ought to watch over this great assemblage of communities and interests so on another that the foundation of our national policy will be laid in the pure and immutable principles of private morality and the preeminence of free government be exemplified by all the attributes which can win the affections of its citizens and command the respect of the world i dwell on this prospect with every satisfaction which an ardent love for my country can inspire since there is no truth more thoroughly established than that there exists in the economy and course of nature an indissoluble union between virtue and happiness between duty and advantage between the genuine maxims of an honest and magnanimous policy and the solid rewards of public prosperity and felicity since we ought to be no less persuaded that the propitious smiles of heaven can never be expected on a nation that disregards the eternal rules of order and right which heaven itself has ordained and since the preservation of the sacred fire of liberty and the destiny of the republican model of government are justly considered perhaps as deeply as finally staked on the experiment entrusted to the hands of the american people", 
## "besides the ordinary objects submitted to your care it will remain with your judgment to decide how far an exercise of the occasional power delegated by the fifth article of the constitution is rendered expedient at the present juncture by the nature of objections which have been urged against the system or by the degree of inquietude which has given birth to them instead of undertaking particular recommendations on this subject in which i could be guided by no lights derived from official opportunities i shall again give way to my entire confidence in your discernment and pursuit of the public good for i assure myself that whilst you carefully avoid every alteration which might endanger the benefits of an united and effective government or which ought to await the future lessons of experience a reverence for the characteristic rights of freemen and a regard for the public harmony will sufficiently influence your deliberations on the question how far the former can be impregnably fortified or the latter be safely and advantageously promoted", 
## "to the foregoing observations i have one to add which will be most properly addressed to the house of representatives it concerns myself and will therefore be as brief as possible when i was first honored with a call into the service of my country then on the eve of an arduous struggle for its liberties the light in which i contemplated my duty required that i should renounce every pecuniary compensation from this resolution i have in no instance departed and being still under the impressions which produced it i must decline as inapplicable to myself any share in the personal emoluments which may be indispensably included in a permanent provision for the executive department and must accordingly pray that the pecuniary estimates for the station in which i am placed may during my continuance in it be limited to such actual expenditures as the public good may be thought to require", 
## "having thus imparted to you my sentiments as they have been awakened by the occasion which brings us together i shall take my present leave but not without resorting once more to the benign parent of the human race in humble supplication that since he has been pleased to favor the american people with opportunities for deliberating in perfect tranquillity and dispositions for deciding with unparalleled unanimity on a form of government for the security of their union and the advancement of their happiness so his divine blessing may be equally conspicuous in the enlarged views the temperate consultations and the wise measures on which the success of this government must depend"
## ), meta = list(author = character(0), datetimestamp = list(sec = 18.4179799556732, min = 14, hour = 18, mday = 7, mon = 0, year = 123, wday = 6, yday = 6, isdst = 0), description = character(0), heading = character(0), id = character(0), language = character(0), origin = character(0))))
## list()
## list()

3.5 Remove “stopwords”

Stop words are available in abundance in any human language. By removing these words, we remove the low-level information from our text in order to give more focus to the important information.

# For a list of the stopwords, see:   
length(stopwords("english"))   
## [1] 174
docs <- tm::tm_map(docs, removeWords, stopwords("english"))   
docs <- tm::tm_map(docs, PlainTextDocument)
writeLines(as.character(docs[1]))
## list(list(content = c("fellowcitizens   senate    house  representatives", "among  vicissitudes incident  life  event   filled   greater anxieties      notification  transmitted   order  received   th day   present month   one hand   summoned   country whose voice  can never hear   veneration  love   retreat    chosen   fondest predilection    flattering hopes   immutable decision   asylum   declining years— retreat   rendered every day  necessary  well   dear     addition  habit  inclination   frequent interruptions   health   gradual waste committed    time    hand  magnitude  difficulty   trust    voice   country called   sufficient  awaken   wisest   experienced   citizens  distrustful scrutiny   qualifications    overwhelm  despondence one  inheriting inferior endowments  nature  unpracticed   duties  civil administration    peculiarly conscious    deficiencies   conflict  emotions   dare aver       faithful study  collect  duty   just appreciation  every circumstance    might  affected   dare hope     executing  task     much swayed   grateful remembrance  former instances    affectionate sensibility   transcendent proof   confidence   fellowcitizens   thence  little consulted  incapacity  well  disinclination   weighty  untried cares    error will  palliated   motives  mislead see app note    consequences  judged   country   share   partiality    originated", 
## "   impressions      obedience   public summons repaired   present station    peculiarly improper  omit   first official act  fervent supplications   almighty   rules   universe  presides   councils  nations  whose providential aids can supply every human defect   benediction may consecrate   liberties  happiness   people   united states  government instituted     essential purposes  may enable every instrument employed   administration  execute  success  functions allotted   charge  tendering  homage   great author  every public  private good  assure    expresses  sentiments  less        fellowcitizens  large less  either  people can  bound  acknowledge  adore  invisible hand  conducts  affairs  men      united states every step     advanced   character   independent nation seems    distinguished   token  providential agency    important revolution just accomplished   system   united government  tranquil deliberations  voluntary consent   many distinct communities    event  resulted can   compared   means    governments   established without  return  pious gratitude along   humble anticipation   future blessings   past seem  presage  reflections arising    present crisis  forced   strongly   mind   suppressed  will join    trust  thinking    none   influence    proceedings   new  free government can  auspiciously commence", 
## "  article establishing  executive department   made  duty   president  recommend   consideration  measures   shall judge necessary  expedient  circumstances    now meet  will acquit   entering   subject    refer   great constitutional charter     assembled    defining  powers designates  objects    attention    given  will   consistent   circumstances  far  congenial   feelings  actuate   substitute  place   recommendation  particular measures  tribute   due   talents  rectitude   patriotism  adorn  characters selected  devise  adopt    honorable qualifications  behold  surest pledges    one side  local prejudices  attachments  separate views  party animosities will misdirect  comprehensive  equal eye    watch   great assemblage  communities  interests   another   foundation   national policy will  laid   pure  immutable principles  private morality   preeminence  free government  exemplified    attributes  can win  affections   citizens  command  respect   world  dwell   prospect  every satisfaction   ardent love   country can inspire since    truth  thoroughly established    exists   economy  course  nature  indissoluble union  virtue  happiness  duty  advantage   genuine maxims   honest  magnanimous policy   solid rewards  public prosperity  felicity since      less persuaded   propitious smiles  heaven can never  expected   nation  disregards  eternal rules  order  right  heaven   ordained  since  preservation   sacred fire  liberty   destiny   republican model  government  justly considered perhaps  deeply  finally staked   experiment entrusted   hands   american people", 
## "besides  ordinary objects submitted   care  will remain   judgment  decide  far  exercise   occasional power delegated   fifth article   constitution  rendered expedient   present juncture   nature  objections    urged   system    degree  inquietude   given birth   instead  undertaking particular recommendations   subject      guided   lights derived  official opportunities  shall  give way   entire confidence   discernment  pursuit   public good   assure   whilst  carefully avoid every alteration  might endanger  benefits   united  effective government     await  future lessons  experience  reverence   characteristic rights  freemen   regard   public harmony will sufficiently influence  deliberations   question  far  former can  impregnably fortified   latter  safely  advantageously promoted", 
## "  foregoing observations   one  add  will   properly addressed   house  representatives  concerns   will therefore   brief  possible    first honored   call   service   country    eve   arduous struggle   liberties  light    contemplated  duty required    renounce every pecuniary compensation   resolution     instance departed   still   impressions  produced   must decline  inapplicable    share   personal emoluments  may  indispensably included   permanent provision   executive department  must accordingly pray   pecuniary estimates   station     placed may   continuance    limited   actual expenditures   public good may  thought  require", 
## " thus imparted    sentiments     awakened   occasion  brings us together  shall take  present leave   without resorting     benign parent   human race  humble supplication  since    pleased  favor  american people  opportunities  deliberating  perfect tranquillity  dispositions  deciding  unparalleled unanimity   form  government   security   union   advancement   happiness   divine blessing may  equally conspicuous   enlarged views  temperate consultations   wise measures    success   government must depend"
## ), meta = list(author = character(0), datetimestamp = list(sec = 18.6481308937073, min = 14, hour = 18, mday = 7, mon = 0, year = 123, wday = 6, yday = 6, isdst = 0), description = character(0), heading = character(0), id = character(0), language = character(0), origin = character(0))))
## list()
## list()

3.6 Remove particular stopwords

#docs <- tm::tm_map(docs, removeWords, c("syllogism", "tautology"))   
# Just remove the words "syllogism" and "tautology". 
# These words don't actually exist in these texts. But this is how you would remove them if they had.

3.7 Retain compouned words

If you wish to preserve a concept which is only apparent as a collection of two or more words, then you can combine them or reduce them to a meaningful acronym before you begin the analysis. Here, we are using examples that are particular to qualitative data analysis.

for (j in seq(docs))
{
  docs[[j]] <- gsub("fake news", "fake_news", docs[[j]])
  docs[[j]] <- gsub("inner city", "inner-city", docs[[j]])
  docs[[j]] <- gsub("politically correct", "politically_correct", docs[[j]])
}
docs <- tm_map(docs, PlainTextDocument)

3.8 Strip unnecessary white space

docs <- tm_map(docs, stripWhitespace)
writeLines(as.character(docs[1]))
## list(list(content = c("fellowcitizens senate house representatives", "among vicissitudes incident life event filled greater anxieties notification transmitted order received th day present month one hand summoned country whose voice can never hear veneration love retreat chosen fondest predilection flattering hopes immutable decision asylum declining years— retreat rendered every day necessary well dear addition habit inclination frequent interruptions health gradual waste committed time hand magnitude difficulty trust voice country called sufficient awaken wisest experienced citizens distrustful scrutiny qualifications overwhelm despondence one inheriting inferior endowments nature unpracticed duties civil administration peculiarly conscious deficiencies conflict emotions dare aver faithful study collect duty just appreciation every circumstance might affected dare hope executing task much swayed grateful remembrance former instances affectionate sensibility transcendent proof confidence fellowcitizens thence little consulted incapacity well disinclination weighty untried cares error will palliated motives mislead see app note consequences judged country share partiality originated", 
## " impressions obedience public summons repaired present station peculiarly improper omit first official act fervent supplications almighty rules universe presides councils nations whose providential aids can supply every human defect benediction may consecrate liberties happiness people united states government instituted essential purposes may enable every instrument employed administration execute success functions allotted charge tendering homage great author every public private good assure expresses sentiments less fellowcitizens large less either people can bound acknowledge adore invisible hand conducts affairs men united states every step advanced character independent nation seems distinguished token providential agency important revolution just accomplished system united government tranquil deliberations voluntary consent many distinct communities event resulted can compared means governments established without return pious gratitude along humble anticipation future blessings past seem presage reflections arising present crisis forced strongly mind suppressed will join trust thinking none influence proceedings new free government can auspiciously commence", 
## " article establishing executive department made duty president recommend consideration measures shall judge necessary expedient circumstances now meet will acquit entering subject refer great constitutional charter assembled defining powers designates objects attention given will consistent circumstances far congenial feelings actuate substitute place recommendation particular measures tribute due talents rectitude patriotism adorn characters selected devise adopt honorable qualifications behold surest pledges one side local prejudices attachments separate views party animosities will misdirect comprehensive equal eye watch great assemblage communities interests another foundation national policy will laid pure immutable principles private morality preeminence free government exemplified attributes can win affections citizens command respect world dwell prospect every satisfaction ardent love country can inspire since truth thoroughly established exists economy course nature indissoluble union virtue happiness duty advantage genuine maxims honest magnanimous policy solid rewards public prosperity felicity since less persuaded propitious smiles heaven can never expected nation disregards eternal rules order right heaven ordained since preservation sacred fire liberty destiny republican model government justly considered perhaps deeply finally staked experiment entrusted hands american people", 
## "besides ordinary objects submitted care will remain judgment decide far exercise occasional power delegated fifth article constitution rendered expedient present juncture nature objections urged system degree inquietude given birth instead undertaking particular recommendations subject guided lights derived official opportunities shall give way entire confidence discernment pursuit public good assure whilst carefully avoid every alteration might endanger benefits united effective government await future lessons experience reverence characteristic rights freemen regard public harmony will sufficiently influence deliberations question far former can impregnably fortified latter safely advantageously promoted", 
## " foregoing observations one add will properly addressed house representatives concerns will therefore brief possible first honored call service country eve arduous struggle liberties light contemplated duty required renounce every pecuniary compensation resolution instance departed still impressions produced must decline inapplicable share personal emoluments may indispensably included permanent provision executive department must accordingly pray pecuniary estimates station placed may continuance limited actual expenditures public good may thought require", 
## " thus imparted sentiments awakened occasion brings us together shall take present leave without resorting benign parent human race humble supplication since pleased favor american people opportunities deliberating perfect tranquillity dispositions deciding unparalleled unanimity form government security union advancement happiness divine blessing may equally conspicuous enlarged views temperate consultations wise measures success government must depend"), meta = list(author = character(0), datetimestamp = list(
##     sec = 18.6965968608856, min = 14, hour = 18, mday = 7, mon = 0, year = 123, wday = 6, yday = 6, isdst = 0), description = character(0), heading = character(0), id = character(0), language = character(0), origin = character(0))))
## list()
## list()
docs <- tm_map(docs, PlainTextDocument)

3.9 Stemming

The stemDocument from the tm package performs stemming on the documents. However, it doesn’t perform all the needed stemming transformations correctly. To resolve this, first a copy of corpus is stored and then apply the stemDocument

dictCorpus <- docs
docs <- tm_map(docs, stemDocument)
writeLines(as.character(docs[1]))
## list(list(content = c("fellowcitizen senat hous repres", "among vicissitud incid life event fill greater anxieti notif transmit order receiv th day present month one hand summon countri whose voic can never hear vener love retreat chosen fondest predilect flatter hope immut decis asylum declin years— retreat render everi day necessari well dear addit habit inclin frequent interrupt health gradual wast commit time hand magnitud difficulti trust voic countri call suffici awaken wisest experienc citizen distrust scrutini qualif overwhelm despond one inherit inferior endow natur unpract duti civil administr peculiar conscious defici conflict emot dare aver faith studi collect duti just appreci everi circumst might affect dare hope execut task much sway grate remembr former instanc affection sensibl transcend proof confid fellowcitizen thenc littl consult incapac well disinclin weighti untri care error will palliat motiv mislead see app note consequ judg countri share partial origin", 
## "impress obedi public summon repair present station peculiar improp omit first offici act fervent supplic almighti rule univers presid council nation whose providenti aid can suppli everi human defect benedict may consecr liberti happi peopl unit state govern institut essenti purpos may enabl everi instrument employ administr execut success function allot charg tender homag great author everi public privat good assur express sentiment less fellowcitizen larg less either peopl can bound acknowledg ador invis hand conduct affair men unit state everi step advanc charact independ nation seem distinguish token providenti agenc import revolut just accomplish system unit govern tranquil deliber voluntari consent mani distinct communiti event result can compar mean govern establish without return pious gratitud along humbl anticip futur bless past seem presag reflect aris present crisi forc strong mind suppress will join trust think none influenc proceed new free govern can auspici commenc", 
## "articl establish execut depart made duti presid recommend consider measur shall judg necessari expedi circumst now meet will acquit enter subject refer great constitut charter assembl defin power design object attent given will consist circumst far congeni feel actuat substitut place recommend particular measur tribut due talent rectitud patriot adorn charact select devis adopt honor qualif behold surest pledg one side local prejudic attach separ view parti animos will misdirect comprehens equal eye watch great assemblag communiti interest anoth foundat nation polici will laid pure immut principl privat moral preemin free govern exemplifi attribut can win affect citizen command respect world dwell prospect everi satisfact ardent love countri can inspir sinc truth thorough establish exist economi cours natur indissolubl union virtu happi duti advantag genuin maxim honest magnanim polici solid reward public prosper felic sinc less persuad propiti smile heaven can never expect nation disregard etern rule order right heaven ordain sinc preserv sacr fire liberti destini republican model govern just consid perhap deepli final stake experi entrust hand american peopl", 
## "besid ordinari object submit care will remain judgment decid far exercis occasion power deleg fifth articl constitut render expedi present junctur natur object urg system degre inquietud given birth instead undertak particular recommend subject guid light deriv offici opportun shall give way entir confid discern pursuit public good assur whilst care avoid everi alter might endang benefit unit effect govern await futur lesson experi rever characterist right freemen regard public harmoni will suffici influenc deliber question far former can impregn fortifi latter safe advantag promot", 
## "forego observ one add will proper address hous repres concern will therefor brief possibl first honor call servic countri eve arduous struggl liberti light contempl duti requir renounc everi pecuniari compens resolut instanc depart still impress produc must declin inapplic share person emolu may indispens includ perman provis execut depart must accord pray pecuniari estim station place may continu limit actual expenditur public good may thought requir", "thus impart sentiment awaken occas bring us togeth shall take present leav without resort benign parent human race humbl supplic sinc pleas favor american peopl opportun deliber perfect tranquil disposit decid unparallel unanim form govern secur union advanc happi divin bless may equal conspicu enlarg view temper consult wise measur success govern must depend"
## ), meta = list(author = character(0), datetimestamp = list(sec = 18.7270286083221, min = 14, hour = 18, mday = 7, mon = 0, year = 123, wday = 6, yday = 6, isdst = 0), description = character(0), heading = character(0), id = character(0), language = character(0), origin = character(0))))
## list()
## list()

Secondly, stemCompletion can be used from tm to account for the wrong trasnformations and complete the stems by referencing to the copied corpus (also called dictionary). However, the original stemCompletion function replaces empty strings with unsolicited never existed words. To avoid this, we defined a modified version of stemCompletion.

stemCompletion_mod <- function(x, dictionary) {
   x <- unlist(strsplit(as.character(x), " "))
   x <- x[x != ""]
   x <- stemCompletion(x, dictionary=dictionary)
   x <- paste(x, sep="", collapse=" ")
   PlainTextDocument(stripWhitespace(x))
 }
docs <- lapply(docs, stemCompletion_mod, dictionary=dictCorpus)
docs <- Corpus(VectorSource(docs))
writeLines(as.character(docs[1]))
## list(content = "fellowcitizens senate house representatives among vicissitudes incident life events fill greater anxieties notification transmit order receive things day present months one hand summoned countries whose voice can never heart veneration love retreat chosen fondest predilection flattered hope immutable decisions asylum decline years— retreat render everincreasing day necessarily well dear additional habits inclination frequent interrupted health gradually waste committed time hand magnitude difficulties trust voice countries called sufficient awakened wisest experience citizens distrust scrutinize qualifications overwhelming despondence one inheritance inferior endowed nature unpracticed duties civil administration peculiar consciousness deficit conflict emotions dare avert faith studied collected duties justice appreciation everincreasing circumstances might affecting dare hope executive task much swayed grateful remembrance former instance affection sensible transcendent proof confidence fellowcitizens thence little consultations incapacity well disinclination weightiest untried care error will palliated motives mislead see appear note consequences judgment countries share partial original impressed obedience public summoned repair present station peculiar improper omit first official action fervent supplications rule universal president councils nation whose providential aid can supplications everincreasing human defects benediction may consecrate liberties happiness people united states government institutions essential purpose may enable everincreasing instrument employed administration executive success functions allotted charged tender homage great authority everincreasing public private good assured expression sentiment less fellowcitizens large less either people can bound acknowledged adore invisible hand conduct affairs men united states everincreasing steps advance character independence nation seem distinguished token providential agencies important revolution justice accomplished system united government tranquillity deliberate voluntarily consent manifest distinction communities events result can comparative means government established without return pious gratitude along humble anticipated future blessings past seem presage reflect arise present crisis force strong mind suppression will join trust think none influence proceed new free government can auspicious commencement articles established executive departments made duties president recommend consideration measures shall judgment necessarily expedient circumstances now meet will acquit enterprise subject reference great constitution charter assembled define power designed object attention given will consistent circumstances far congenial feel actuated substitute place recommend particular measures tribute due talents rectitude patriotism adorn character selected devised adoption honor qualifications behold surest pledge one side local prejudice attachment separate view parties animosities will misdirect comprehensive equal eyes watching great assemblage communities interests another foundations nation policies will laid pure immutable principles private moral preeminent free government exemplified attributes can win affecting citizens command respect world dwell prospect everincreasing satisfaction ardent love countries can inspire since truth thorough established existence economic course nature indissoluble union virtue happiness duties advantage genuine maxim honest magnanimity policies solid rewards public prosperity felicity since less persuaded propitious smiles heaven can never expect nation disregard eternal rule order rights heaven ordained since preserve sacred fire liberties destinies republican model government justice consideration perhaps finally stake experience entrusted hand american people besides object submit care will remain judgment decide far exercise occasion power delegated fifth articles constitution render expedient present juncture nature object urge system degree inquietude given birth instead undertake particular recommend subject guidance light derived official opportunity shall give way entire confidence discern pursuit public good assured whilst care avoid everincreasing altered might endanger benefits united effect government await future lesson experience reverence characteristic rights freemen regard public harmonious will sufficient influence deliberate question far former can impregnable fortifications latter safety advantage promote forego observe one add will proper address house representatives concern will therefore brief possible first honor called service countries every arduous struggle liberties light contemplate duties require renounce everincreasing compensation resolution instance departments still impressed produce must decline inapplicable share personal emoluments may indispensable including permanent provision executive departments must according prayer estimate station place may continue limits actual expenditures public good may thought require thus impartial sentiment awakened occasion bring us together shall take present leave without resort benign parent human race humble supplications since pleasing favor american people opportunity deliberate perfect tranquillity disposition decide unparalleled unanimity form government secure union advance happiness divine blessings may equal conspicuous enlarged view temper consultations wise measures success government must depend", 
##     meta = list(author = character(0), datetimestamp = list(sec = 52.9806113243103, min = 14, hour = 18, mday = 7, mon = 0, year = 123, wday = 6, yday = 6, isdst = 0), description = character(0), heading = character(0), id = character(0), language = character(0), origin = character(0)))
## list(language = "en")
## list()

3.10 Type Check

Be sure to use the following script once you have completed preprocessing. This tells R to treat the preprocessed documents as text documents.

docs <- tm::tm_map(docs, stripWhitespace)
## Warning in tm_map.SimpleCorpus(docs, stripWhitespace): transformation drops
## documents
writeLines(as.character(docs[1]))
## list(content = "fellowcitizens senate house representatives among vicissitudes incident life events fill greater anxieties notification transmit order receive things day present months one hand summoned countries whose voice can never heart veneration love retreat chosen fondest predilection flattered hope immutable decisions asylum decline years— retreat render everincreasing day necessarily well dear additional habits inclination frequent interrupted health gradually waste committed time hand magnitude difficulties trust voice countries called sufficient awakened wisest experience citizens distrust scrutinize qualifications overwhelming despondence one inheritance inferior endowed nature unpracticed duties civil administration peculiar consciousness deficit conflict emotions dare avert faith studied collected duties justice appreciation everincreasing circumstances might affecting dare hope executive task much swayed grateful remembrance former instance affection sensible transcendent proof confidence fellowcitizens thence little consultations incapacity well disinclination weightiest untried care error will palliated motives mislead see appear note consequences judgment countries share partial original impressed obedience public summoned repair present station peculiar improper omit first official action fervent supplications rule universal president councils nation whose providential aid can supplications everincreasing human defects benediction may consecrate liberties happiness people united states government institutions essential purpose may enable everincreasing instrument employed administration executive success functions allotted charged tender homage great authority everincreasing public private good assured expression sentiment less fellowcitizens large less either people can bound acknowledged adore invisible hand conduct affairs men united states everincreasing steps advance character independence nation seem distinguished token providential agencies important revolution justice accomplished system united government tranquillity deliberate voluntarily consent manifest distinction communities events result can comparative means government established without return pious gratitude along humble anticipated future blessings past seem presage reflect arise present crisis force strong mind suppression will join trust think none influence proceed new free government can auspicious commencement articles established executive departments made duties president recommend consideration measures shall judgment necessarily expedient circumstances now meet will acquit enterprise subject reference great constitution charter assembled define power designed object attention given will consistent circumstances far congenial feel actuated substitute place recommend particular measures tribute due talents rectitude patriotism adorn character selected devised adoption honor qualifications behold surest pledge one side local prejudice attachment separate view parties animosities will misdirect comprehensive equal eyes watching great assemblage communities interests another foundations nation policies will laid pure immutable principles private moral preeminent free government exemplified attributes can win affecting citizens command respect world dwell prospect everincreasing satisfaction ardent love countries can inspire since truth thorough established existence economic course nature indissoluble union virtue happiness duties advantage genuine maxim honest magnanimity policies solid rewards public prosperity felicity since less persuaded propitious smiles heaven can never expect nation disregard eternal rule order rights heaven ordained since preserve sacred fire liberties destinies republican model government justice consideration perhaps finally stake experience entrusted hand american people besides object submit care will remain judgment decide far exercise occasion power delegated fifth articles constitution render expedient present juncture nature object urge system degree inquietude given birth instead undertake particular recommend subject guidance light derived official opportunity shall give way entire confidence discern pursuit public good assured whilst care avoid everincreasing altered might endanger benefits united effect government await future lesson experience reverence characteristic rights freemen regard public harmonious will sufficient influence deliberate question far former can impregnable fortifications latter safety advantage promote forego observe one add will proper address house representatives concern will therefore brief possible first honor called service countries every arduous struggle liberties light contemplate duties require renounce everincreasing compensation resolution instance departments still impressed produce must decline inapplicable share personal emoluments may indispensable including permanent provision executive departments must according prayer estimate station place may continue limits actual expenditures public good may thought require thus impartial sentiment awakened occasion bring us together shall take present leave without resort benign parent human race humble supplications since pleasing favor american people opportunity deliberate perfect tranquillity disposition decide unparalleled unanimity form government secure union advance happiness divine blessings may equal conspicuous enlarged view temper consultations wise measures success government must depend", meta = list(author = character(0), datetimestamp = list(sec = 52.9806113243103, min = 14, hour = 18, mday = 7, mon = 0, year = 123, wday = 6, yday = 6, isdst = 0), description = character(0), heading = character(0), id = character(0), language = character(0), origin = character(0)))
## list(language = "en")
## list()
nrow(df)
## [1] 59

In the below piece of code, we save the preprocessed documents into another folder because later on we need to reuse the results to measure the document similarity using another library which is textreuse.

#Elnaz

for(i in 1:nrow(df)) {       # for-loop over rows
  df_i <- df[i, ]
  name <- df_i$president
  year <- df_i$year
  text <- df_i$content
  file_name <- paste(as.character(year), 
                     as.character(name), 
                     sep="-")
  file_name <- paste(file_name, ".txt", 
                     sep="")
  loc <- paste("./data/pre_processed/", file_name, sep="")
  writeLines(as.character(docs[[i]]), loc)
}

3.11 Create Doc Term Matrix

A document-term matrix or term-document matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms.

dtm <- tm::DocumentTermMatrix(docs)   
dtm 
## <<DocumentTermMatrix (documents: 59, terms: 5352)>>
## Non-/sparse entries: 34813/280955
## Sparsity           : 89%
## Maximal term length: 23
## Weighting          : term frequency (tf)

Storing transpose of matrix

tdm <- tm::TermDocumentMatrix(docs)   
tdm  
## <<TermDocumentMatrix (terms: 5352, documents: 59)>>
## Non-/sparse entries: 34813/280955
## Sparsity           : 89%
## Maximal term length: 23
## Weighting          : term frequency (tf)

3.12 Organize by frequency

freq <- colSums(as.matrix(dtm))   
length(freq)   
## [1] 5352
ord <- order(freq)
m <- as.matrix(dtm)   
dim(m)  
## [1]   59 5352

Store the matrix to memory

#write.csv(m, file="DocumentTermMatrix.csv")   

3.13 Remove sparse words

We remove sparse words putting a 20% sparsity thresshold, and when we check our results, the sparsity for our matrix is 12%.

#  Start by removing sparse terms:   
dtms <- removeSparseTerms(dtm, 0.2) # This makes a matrix that is 20% empty space, maximum.   
dtms
## <<DocumentTermMatrix (documents: 59, terms: 57)>>
## Non-/sparse entries: 3117/246
## Sparsity           : 7%
## Maximal term length: 14
## Weighting          : term frequency (tf)

4 Term Similarity

We save the matrix as the frequency of the terms.

freq <- colSums(as.matrix(dtm))

Least frequent

We print the head of the frequency table. Our table is increasing. So the ones appearing at the head have 1 frequency therefore the smallest possible number and it increases until at the tail we have the most frequent words.

head(table(freq), 20) 
## freq
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
## 2062  680  372  259  205  145  131   90   87   73   84   63   68   54   51   45 
##   17   18   19   20 
##   39   43   31   36

The top number is the frequency with which words appear and the bottom number reflects how many words appear that frequently.

Most frequent:

tail(table(freq), 40) 
## freq
## 159 161 164 165 174 176 177 178 185 194 195 201 202 210 229 231 232 250 253 268 
##   2   1   1   1   1   1   1   1   2   1   1   1   1   1   2   2   1   1   1   1 
## 272 279 282 289 295 299 304 314 341 346 353 373 374 379 461 488 621 685 724 963 
##   1   2   1   1   1   1   1   1   1   1   1   1   2   1   1   1   1   1   1   1

Below we show a table of the terms we selected when we removed sparse terms in subsection Remove sparse words We print the 20 first most frequent terms.

freq <- sort(colSums(as.matrix(dtm)), decreasing=TRUE)     
freq |> head(20)
##           will     government         nation         people            can 
##            963            724            685            621            488 
##         states          great          power           upon           must 
##            461            379            374            374            373 
##      countries          world            may          shall everincreasing 
##            353            346            341            314            304 
##   constitution  character(0),        justice          peace            one 
##            299            295            289            282            279

Below we identify all terms that appear frequently.

findFreqTerms(dtm, lowfreq=50) |> head(20)
##  [1] "0),"            "123,"           "action"         "administration"
##  [5] "advance"        "aid"            "american"       "among"         
##  [9] "another"        "authority"      "blessings"      "bring"         
## [13] "called"         "can"            "care"           "character"     
## [17] "character(0)))" "character(0),"  "citizens"       "civil"

Another approach to perform the same task:

wf <- data.frame(word=names(freq), freq=freq)   
head(wf) 
##                  word freq
## will             will  963
## government government  724
## nation         nation  685
## people         people  621
## can               can  488
## states         states  461

4.1 Word Frequency Plot

Now it is time to visualize our results to better understand and perceive them. Using ggplot we show a bar plot with words that appear more than 200 times. In the x-axis we can see clearly which are these words. They are presented in the root form since we applied stemming.

p <- ggplot(subset(wf, freq>200), aes(x = reorder(word, -freq), y = freq)) + 
  geom_bar(stat = "identity") + 
  theme(axis.text.x=element_text(angle=45, hjust=1))

p   

4.2 Relationships Between Terms

Here we find the correlations between the terms as if 2 words are always appeared together in a text then the correlation between them would be 1. The correlation limit is considered as 0.75:

tm::findAssocs(dtm, c("government" , "states"), corlimit=0.75)
## $government
## system 
##   0.78 
## 
## $states
##      portion constitution       duties       object    existence         ruin 
##         0.83         0.80         0.80         0.80         0.79         0.79 
##          may 
##         0.78
findAssocs(dtms, "government", corlimit=0.70) # specifying a correlation limit of 0.95   
## $government
## states 
##   0.75

4.3 Word Clouds

Plot words that occur at least 25 times.

Colorized version:

In this part the word clouds are visualized. The bigger the size of the word in the word cloud, the more frequent it is. Also words are clustered based one frequency with different colors.

set.seed(142)   
wordcloud::wordcloud(names(freq), freq, min.freq=20, scale=c(5, .1), colors=brewer.pal(6, "Dark2")) 
## Warning in wordcloud::wordcloud(names(freq), freq, min.freq = 20, scale = c(5, :
## government could not be fit on page. It will not be plotted.
## Warning in wordcloud::wordcloud(names(freq), freq, min.freq = 20, scale = c(5, :
## people could not be fit on page. It will not be plotted.
## Warning in wordcloud::wordcloud(names(freq), freq, min.freq = 20, scale = c(5, :
## everincreasing could not be fit on page. It will not be plotted.
## Warning in wordcloud::wordcloud(names(freq), freq, min.freq = 20, scale = c(5, :
## character(0), could not be fit on page. It will not be plotted.

Plot words that occur at least 100 times.

We use the same way of plotting, therefore the size and color stand for the same reasons.

set.seed(142)   
dark2 <- brewer.pal(6, "Dark2")   
wordcloud::wordcloud(names(freq), freq, max.words=100, rot.per=0.2, colors=dark2)  
## Warning in wordcloud::wordcloud(names(freq), freq, max.words = 100, rot.per =
## 0.2, : government could not be fit on page. It will not be plotted.

4.4 Clustering by Term Similarity

4.4.1 Hierarchal Clustering

To do the Hierarchal clustering, first we should find the distance between words and for this purpose we used Euclidian norm and then clustered based on those distances.

d <- dist(t(dtms), method="euclidian")   
fit <- hclust(d=d, method="complete")   # for a different look try substituting: method="ward.D"
fit   
## 
## Call:
## hclust(d = d, method = "complete")
## 
## Cluster method   : complete 
## Distance         : euclidean 
## Number of objects: 57

Dendrograms are the plots used to visualize the hierarchal clustering. If the height of the line joining 2 terms is smaller, it shows that they are more similar, whereas Higher lines in dendrograms indicate larger distance between the clusters.

plot(fit, hang=-1)

And here the red boxes show the 6 clusters:

plot.new()
plot(fit, hang=-1)
groups <- cutree(fit, k=6)   # "k=" defines the number of clusters you are using   
rect.hclust(fit, k=6, border="red") # draw dendogram with red borders around the 6 clusters

4.4.2 K-means Clustering

To do the k-means clustering, first we should find the distance between words and for this purpose we used 3 different norms(“Euclidian”, “Manhattan”,“Maximum”) and the clustered based on them.

In what follows, there are clusplots for K-Means clustering with different norms. In clusplots, each ellipse indicate a Principal Component. At the bottom of each plot we can see the percentage of the point variability explained by these components.

Norm: Euclidean

d <- dist(t(dtms), method="euclidean")   
kfit <- kmeans(d, 2)   
clusplot(as.matrix(d), kfit$cluster, color=T, shade=T, labels=2, lines=0)

d <- dist(t(dtms), method="euclidian")   
kfit <- kmeans(d, 4)   
clusplot(as.matrix(d), kfit$cluster, color=T, shade=T, labels=2, lines=0)

Norm: Manhattan

d <- dist(t(dtms), method="manhattan")   
kfit <- kmeans(d, 4)   
clusplot(as.matrix(d), kfit$cluster, color=T, shade=T, labels=2, lines=0)

d <- dist(t(dtms), method="manhattan")   
kfit <- kmeans(d, 2)   
clusplot(as.matrix(d), kfit$cluster, color=T, shade=T, labels=2, lines=0)

Norm: Maximum

d <- dist(t(dtms), method="maximum")   
kfit <- kmeans(d, 4)   
clusplot(as.matrix(d), kfit$cluster, color=T, shade=T, labels=2, lines=0)

d <- dist(t(dtms), method="maximum")   
kfit <- kmeans(d, 2)   
clusplot(as.matrix(d), kfit$cluster, color=T, shade=T, labels=2, lines=0)

TODO: Use other norms to perform K-means…done(Elnaz)

5 Doc Similarity

For this section, textreuse library, is used.

We compare documents in a pairwise manner and use jaccard similarity to measure the similarity between them. After doing so, a score is calculated for each pair as shown in the results:

#loc <- "./data/texts"
#docs <- tm::VCorpus(DirSource(loc)) 

loc <- "./data/pre_processed/"
corpus <- TextReuseCorpus(dir=loc)


comparisons <- pairwise_compare(corpus, jaccard_similarity)
compare_df <- pairwise_candidates(comparisons)
compare_df <- as.data.frame(compare_df, 
                            col.names = names(compare_df))
#compare_df <- compare_df[order(compare_df$score,decreasing=TRUE)]
compare_df <- compare_df[order(compare_df$score,decreasing=TRUE),]
compare_df |> head(3)
##                         a                          b      score
## 96 1793-George Washington 1945-Franklin D. Roosevelt 0.08333333
## 76 1793-George Washington       1865-Abraham Lincoln 0.07517084
## 86 1793-George Washington    1905-Theodore Roosevelt 0.06137184
#Najada
corpus
## TextReuseCorpus
## Number of documents: 59 
## hash_func : hash_string 
## tokenizer : tokenize_ngrams
writeLines(as.character(corpus[1]))
## list(`1789-George Washington` = list(content = "list(content = \"fellowcitizens senate house representatives among vicissitudes incident life events fill greater anxieties notification transmit order receive things day present months one hand summoned countries whose voice can never heart veneration love retreat chosen fondest predilection flattered hope immutable decisions asylum decline years— retreat render everincreasing day necessarily well dear additional habits inclination frequent interrupted health gradually waste committed time hand magnitude difficulties trust voice countries called sufficient awakened wisest experience citizens distrust scrutinize qualifications overwhelming despondence one inheritance inferior endowed nature unpracticed duties civil administration peculiar consciousness deficit conflict emotions dare avert faith studied collected duties justice appreciation everincreasing circumstances might affecting dare hope executive task much swayed grateful remembrance former instance affection sensible transcendent proof confidence fellowcitizens thence little consultations incapacity well disinclination weightiest untried care error will palliated motives mislead see appear note consequences judgment countries share partial original impressed obedience public summoned repair present station peculiar improper omit first official action fervent supplications rule universal president councils nation whose providential aid can supplications everincreasing human defects benediction may consecrate liberties happiness people united states government institutions essential purpose may enable everincreasing instrument employed administration executive success functions allotted charged tender homage great authority everincreasing public private good assured expression sentiment less fellowcitizens large less either people can bound acknowledged adore invisible hand conduct affairs men united states everincreasing steps advance character independence nation seem distinguished token providential agencies important revolution justice accomplished system united government tranquillity deliberate voluntarily consent manifest distinction communities events result can comparative means government established without return pious gratitude along humble anticipated future blessings past seem presage reflect arise present crisis force strong mind suppression will join trust think none influence proceed new free government can auspicious commencement articles established executive departments made duties president recommend consideration measures shall judgment necessarily expedient circumstances now meet will acquit enterprise subject reference great constitution charter assembled define power designed object attention given will consistent circumstances far congenial feel actuated substitute place recommend particular measures tribute due talents rectitude patriotism adorn character selected devised adoption honor qualifications behold surest pledge one side local prejudice attachment separate view parties animosities will misdirect comprehensive equal eyes watching great assemblage communities interests another foundations nation policies will laid pure immutable principles private moral preeminent free government exemplified attributes can win affecting citizens command respect world dwell prospect everincreasing satisfaction ardent love countries can inspire since truth thorough established existence economic course nature indissoluble union virtue happiness duties advantage genuine maxim honest magnanimity policies solid rewards public prosperity felicity since less persuaded propitious smiles heaven can never expect nation disregard eternal rule order rights heaven ordained since preserve sacred fire liberties destinies republican model government justice consideration perhaps finally stake experience entrusted hand american people besides object submit care will remain judgment decide far exercise occasion power delegated fifth articles constitution render expedient present juncture nature object urge system degree inquietude given birth instead undertake particular recommend subject guidance light derived official opportunity shall give way entire confidence discern pursuit public good assured whilst care avoid everincreasing altered might endanger benefits united effect government await future lesson experience reverence characteristic rights freemen regard public harmonious will sufficient influence deliberate question far former can impregnable fortifications latter safety advantage promote forego observe one add will proper address house representatives concern will therefore brief possible first honor called service countries every arduous struggle liberties light contemplate duties require renounce everincreasing compensation resolution instance departments still impressed produce must decline inapplicable share personal emoluments may indispensable including permanent provision executive departments must according prayer estimate station place may continue limits actual expenditures public good may thought require thus impartial sentiment awakened occasion bring us together shall take present leave without resort benign parent human race humble supplications since pleasing favor american people opportunity deliberate perfect tranquillity disposition decide unparalleled unanimity form government secure union advance happiness divine blessings may equal conspicuous enlarged view temper consultations wise measures success government must depend\", meta = list(author = character(0), datetimestamp = list(sec = 52.9806113243103, min = 14, hour = 18, mday = 7, mon = 0, year = 123, wday = 6, yday = 6, isdst = 0), description = character(0), heading = character(0), id = character(0), language = character(0), origin = character(0)))", 
##     tokens = NULL, hashes = c(-79277512, 1461849276, 587687171, 1400508662, -1528904364, -1176008363, 2104028628, 242661590, 58376289, -1708133020, -781087718, -201365644, 1724523801, -378243016, 311718203, -2014625511, -625873884, -1545226081, -1475521618, -1523095350, -1015023003, -1005865629, 966616660, 1811332323, 1085454582, -1991377151, 1461987856, 2109740839, -1163584676, 373666486, 636751197, -950716712, -1700284989, 474818159, -2145645941, -579825031, 1426802282, -1465868929, 814227590, 
##     -194037280, -1695522327, -1013519161, 1820428539, 2084747588, -929758049, 1815976130, 1013951540, -978794643, -685841600, -1880644656, 1601250174, 437599726, 1047111351, -284541230, -1300956799, 429423337, 264040155, -1151923068, 1709616450, -1142690907, -87400543, 574683610, 720630299, 313783217, 1813713933, -233312219, 431165055, -2010755526, 768567022, 1673368949, 1282301332, -1937480857, -686653338, -279787596, 283377037, -1467375042, 1413012991, 1234290454, 2032835754, -252756277, -2058506222, 
##     -567440084, 586226657, -1699176042, 169265343, -1822480134, -1695678832, -279195864, 993750981, -768462026, -1403486852, 1421944655, -1354910634, 227991290, 695007735, -1910589095, -1491126517, 431863399, -38769206, -486552898, 1506239656, 1410998188, -734962862, -484263926, 71679551, 76514262, -926313942, 849915209, 993045926, -1941284721, 1545847062, -772821371, 1281525178, -1210520160, -1762627151, -1067577156, -1755919354, 1745815471, 2025787534, -1278537472, 867950046, -1538943095, -2072251699, 
##     -512470495, -1791694347, 1246832149, -692510434, -1947597694, 1728541127, 935988054, -287721816, 226643687, 254446636, 67670196, -2131041351, -145873719, 1630845967, 564756075, -22770622, 2102408449, 269094191, 726754039, 2109902787, 2000196516, -968606306, 873033121, 1662242319, 1213235183, -2120436385, 1484757162, -150131618, -494834276, -2024279898, 1732884318, 23998209, 922976392, 109718125, -98143405, 246270223, -65502352, 1471321013, 1575051764, -295641377, -476059460, -395301027, -2115384395, 
##     -1466976344, 626261553, -2117459973, -1228853928, 560910765, -202627190, -834445537, -1368329272, 380808289, -551303023, 1934873975, -154147308, -548807789, 496595518, 406311671, -793724231, 475061020, 1825979089, -2130922830, -785877421, 562009042, -1293054777, 587018552, -412380624, -869481633, 1151100351, -1259833850, -1658668274, 250683501, -1918606848, 1551165118, -2524374, 1142773195, 1063547700, -1024929724, -1173259960, 213162437, -671092094, 789058892, -645935361, 2135177199, -541915347, 
##     1330624931, 636932188, 1522355160, -2135804363, -230168429, 883599386, -1156793337, 1583323743, -1858568612, 1982869419, -956435664, -277103692, -1837519024, -1175788070, 790284491, -408509367, 1825223403, -287972066, -667840782, -1544161943, 1224857782, 1853608939, 1720125949, 311700065, 576419384, 991734634, -629494948, 1274161940, 66678883, 1925704464, 1106239537, -969201260, -825039992, 808922196, 2095053601, -1243049410, -355678648, 1040546274, -491865144, -1235843690, 1572990866, -308542599, 
##     493395278, -228706489, -1520867345, -470997245, -1596300445, -668638245, -1325980228, -953995496, -1611395777, -2043864126, -1710258789, -1567717831, -987311879, 1971826499, -671490816, -476885362, 1387956841, -987292808, -1004514019, -1251652060, 1038015662, 1928002608, 1317937696, -947684133, 359973086, -669694508, 1319835595, 1524958889, -670631520, -1018373780, 819822116, 847788327, -791141545, 1367645726, -1661427287, 850489390, 1278023458, -1561195020, -454654744, -1636293226, -8321666, 
##     1072888337, 668441237, 923382187, 1985266923, 1998388686, -1397380607, -1621794836, -2079787941, -1590416686, 1315007312, 1149535113, -1738771752, 57956568, 72475199, -906468408, -1926347209, -1011809609, -729975456, -1907346796, 1160452830, 1751444229, 418248456, 443995279, -2096730612, 1308170941, 1429547453, 384065393, -405588973, 1902955964, 202742759, -1045686688, 628119015, -65577571, -531138625, 1445846230, -1467929228, 514915678, -1671273866, 776673625, 861096816, 983589687, -1509598572, 
##     841473833, -2101045069, 516653034, -1246150770, -1428572983, 1216145861, -1041044698, -1565025665, -1881814257, 625008151, -904073437, 1793649208, 83312667, -582658888, -1584155092, -468446341, -1217188361, 981568879, 1941697023, 1203087682, 342749175, -816225859, -1513153725, -1760765814, 1934718083, 208337818, 1266610744, 581901443, 1150362467, -109966971, 1474149194, -1108872385, -687968304, 989945917, -1534058343, 1685827859, 151781260, -1401449250, -308703952, -1352487516, -1603828046, 56598588, 
##     -953779480, 326503324, -1118882259, 207823984, -1252670205, -1932639948, 2123386273, 1079912944, 579834589, -749012868, -1424971204, 2095115152, 2130148753, -1616298589, -1332652179, -320330869, -1039787095, 266524910, 772952335, -334601154, 389547656, 770650920, 93642346, -1199739205, 1158754846, -1260818113, 24183885, -613132645, -1054033410, -1388708457, 2017121382, -1302657021, 795196204, 410921085, -110949416, 822188415, 1993626384, -1041149260, -1523651655, -1853687686, -477926326, 1304320134, 
##     861286452, 72735592, -1627571465, 1134277041, -1959208081, 864590062, 574606785, -1112220517, -1428687595, -1783787317, -1559506308, -453162671, 1777136830, -2132303172, 1581990069, -985553630, -1909279106, 610408840, 1785857756, -432533824, 1177587383, -1267268366, -328129309, 1281002623, -2072664896, 13043693, 152993424, -692697901, 112910478, 650714578, 1590989334, -308107026, -1247893866, 1109463134, -1572078755, 666050437, 235306704, 734944041, 965418590, -1494777657, -237949321, -1418673972, 
##     258125388, -436911901, 503703024, 1842305762, 1425240121, 1642198257, -2084056321, -2001774150, -944976157, 1413207480, -884902162, -346693713, -378132048, -1849684419, 1350893468, -191395883, 1866940704, 522201617, -339760933, 301940485, -355532480, -1608505298, -1223100962, 975572501, -1892854512, -1135873262, -2062222422, -1146554835, 572210642, 1674972393, 2020952497, 444041109, -911577406, 606304119, -1902603042, -138133708, 1211158170, 2047196414, -480494287, 711174408, -2040496739, 1914891897, 
##     580935991, -1827202257, 1392888639, -535597725, 1449064779, 1699116160, 1291454555, 1101678545, 431791146, 1752833319, -628275304, 146400286, 1711093935, 1168713890, 1538253965, -1563007631, 483617290, 1007319872, 366039354, 446832742, 2097967162, 1628197313, -2106802246, 165413134, -2123652821, 2066033951, 1433508263, -640666256, -1470555294, 1326392857, -881661575, 758075731, -1605108935, -1265277357, 1985804716, -1445665171, 1433171868, -908837672, 984098224, -252399342, 1073585015, 1500369472, 
##     739632466, -622382048, 1514328586, 374457072, 1799452924, -1544679791, -1417362341, 434359579, -37250300, 1785451918, 1984911491, -203390251, -808653595, -1191015509, 715728308, 32836538, 1293083443, -1074470884, -891045666, 640952152, 536579032, 1022775035, 1241950162, -2058483080, 539805337, -423345352, -2080230456, -1504138522, -1173029011, 206000844, 1947441966, 1405500420, -68476653, 203465844, 647201881, 2078758924, 1674319440, -676763924, 7601371, 938785012, -834439733, -574023142, -1860433984, 
##     418118757, -2107767613, 529032408, -1607210110, 226028305, -387133259, 1465334000, -1590028556, 1363544586, -117054838, -442781939, -475029078, 358271975, -2107491768, 661235271, 1594638859, -847261, -1861551742, 1991168055, 1798201976, -856750456, 1378469421, 301557781, 294017725, -294787353, -1425679221, 906872019, -1791087308, -1991936115, 919794053, 1345727137, 276230813, 370321313, -5906563, 1932237702, 792666683, -2028606752, -713586015, -45722210, -993479184, -867350974, 1143297055, 564712667, 
##     1387518568, 1602413238, -870648938, 810715895, -1317926687, 1505712556, 2136215051, 1081702434, -1875527371, -1903869155, 1616512945, -1778001838, 1948661693, -203136763, 1980877122, -641115451, 298869686, 1200122902, 324093006, -1718715723, 1331698021, -1494708, -379208348, 682867917, 885234303, -228890949, 1232542523, 1752449529, 48313893, 986271319, 623099221, 1433991488, -1378336312, 1441103645, 1378161930, 1207570007, 1180052203, -1321884375, -1870896473, -1831695796, -293421483, 1731668860, 
##     1088640536, 617572082, -1484359135, 616214106, -1325571365, -1274930241, 331421191, -1132890883, 741337451, 1669692561, -251035852, -617749685, -1281758931, -59697630, -545602614, -197824609, -1155953977, -206796033, -1642065678, 817744558, -90768230, 48252282, -1827431882, 490010738, 1835729414, 462569786, -341760018), minhashes = NULL, meta = list(file = "./data/pre_processed//1789-George Washington.txt", hash_func = "hash_string", id = "1789-George Washington", minhash_func = NULL, tokenizer = "tokenize_ngrams")))
## list(hash_func = "hash_string", tokenizer = "tokenize_ngrams")

5.1 Similarity Score Plot

Similarity Measure: Jaccard Similarity

Now our goal is to visualize the similarities. For this purpose, we build a 3D plot which in x-axis has one speech and y-axis has another and on z-axis the scores. In order to visualize in a clean way, we used only the first 30 ones which are the most similar ones and we used only the initials on the presidents. Taken into account that these speeches were made from some of the most important and well-known American Presidents, the plot does not lose its explanatory purposes.

Each score is presented with a ball and the colors represent the clusters. The pairs with similar scores are painted the same color.

#Choosing only the first 50 rows because otherwise the plot becomes unreadable since there are too many points
compare_df_viz <- compare_df[1:30, ]
# Converting names to initials
compare_df_viz$a <- gsub("(?<=[A-Z])[^A-Z]+", "", compare_df_viz$a ,perl = TRUE)

compare_df_viz$b <- gsub("(?<=[A-Z])[^A-Z]+", "", compare_df_viz$b ,perl = TRUE)
fig <- plot_ly(compare_df_viz, x = ~a, y = ~b, z = ~score, color=~score, size=~score)
fig <- fig |> add_markers()
fig <- fig |> layout(scene = list(xaxis = list(title = 'Doc1'),
                     yaxis = list(title = 'Doc2'),
                     zaxis = list(title = 'Similarity Score')
                     ))

fig
## Warning: `line.width` does not currently support multiple values.

Similarity Measure: Ratio of Matches

Here we use another similarity measure. The first one was based on Jaccard Similarity and this one is based in Ratio of matches. The method is the same, but the results are slightly different. Here we have higher similarity measures.

For this reason, this time we plot 50 most similar cases and with smaller ball size.

loc <- "./data/pre_processed/"
corpus <- TextReuseCorpus(dir=loc)


comparisons_rom <- pairwise_compare(corpus, ratio_of_matches)
compare_df_rom <- pairwise_candidates(comparisons_rom)
compare_df_rom <- as.data.frame(compare_df_rom, 
                            col.names = names(compare_df_rom))
#compare_df <- compare_df[order(compare_df$score,decreasing=TRUE)]
compare_df_rom <- compare_df_rom[order(compare_df_rom$score,decreasing=TRUE),]
compare_df_rom |> head(3)
##                           a                          b     score
## 1    1789-George Washington     1793-George Washington 0.3529412
## 1392 1921-Warren G. Harding 1945-Franklin D. Roosevelt 0.1162791
## 1416   1925-Calvin Coolidge 1945-Franklin D. Roosevelt 0.1096346
compare_df_rom_viz <- compare_df_rom[1:50, ]
compare_df_rom_viz$a <- gsub("(?<=[A-Z])[^A-Z]+", "", compare_df_rom_viz$a ,perl = TRUE)

compare_df_rom_viz$b <- gsub("(?<=[A-Z])[^A-Z]+", "", compare_df_rom_viz$b ,perl = TRUE)
fig <- plot_ly(compare_df_rom_viz, x = ~a, y = ~b, z = ~score, color=~score, size=~score)
fig <- fig |> add_markers()
fig <- fig |> layout(scene = list(xaxis = list(title = 'Doc1'),
                     yaxis = list(title = 'Doc2'),
                     zaxis = list(title = 'Similarity Score')
                     ))

fig
## Warning: `line.width` does not currently support multiple values.

5.2 Extra Material: Distance Matrix

Since the textreuse library doesn’t output a distance matrix, and instead we can only have a dataframe with three columns, two of which contain documents’ names, and the third one contain their similarity score (computed from Jaccard similarity), we implemented the transformation of the mentioned dataframe to a distance matrix. To achieve this, we pivot the score dataframe in the following manner:

distance_df <- compare_df |> pivot_wider(names_from=a, values_from=score)

distance_df <- replace(distance_df, is.na(distance_df), 0)

distance_mat <- data.matrix(distance_df)

Moreover, the library doesn’t provide a function to compute cosine similarity between any pair of documents of the corpus, in below we implemented computation of cosine similarity between two given documents of the corpus, and then construct a distance matrix for all documents of the corpus.

# compute cosine similarity between two documents
dtms[,1]
## <<DocumentTermMatrix (documents: 59, terms: 1)>>
## Non-/sparse entries: 59/0
## Sparsity           : 0%
## Maximal term length: 3
## Weighting          : term frequency (tf)
cosine_sim <- tcrossprod_simple_triplet_matrix(dtms[,1], dtms[,2])/sqrt(row_sums(dtms[,2]^2) %*% t(row_sums(dtms[,1]^2)))
# construct cosine distance matrix
cosine_dist_mat <- 1 - crossprod_simple_triplet_matrix(dtms)/(sqrt(col_sums(dtms^2) %*% t(col_sums(dtms^2))))

cosine_dist_mat
##                 Terms
## Terms                  0),      123,    action  american    called       can
##   0),            0.0000000 0.0000000 0.2736932 0.3051469 0.2325625 0.2105213
##   123,           0.0000000 0.0000000 0.2736932 0.3051469 0.2325625 0.2105213
##   action         0.2736932 0.2736932 0.0000000 0.4512145 0.4068406 0.1914627
##   american       0.3051469 0.3051469 0.4512145 0.0000000 0.3411302 0.3682255
##   called         0.2325625 0.2325625 0.4068406 0.3411302 0.0000000 0.3587294
##   can            0.2105213 0.2105213 0.1914627 0.3682255 0.3587294 0.0000000
##   character(0))) 0.0000000 0.0000000 0.2736932 0.3051469 0.2325625 0.2105213
##   character(0),  0.0000000 0.0000000 0.2736932 0.3051469 0.2325625 0.2105213
##   citizens       0.3084235 0.3084235 0.2871853 0.3479934 0.3281584 0.2930308
##   countries      0.2694235 0.2694235 0.2346954 0.4966198 0.3884188 0.2823231
##   datetimestamp  0.0000000 0.0000000 0.2736932 0.3051469 0.2325625 0.2105213
##   description    0.0000000 0.0000000 0.2736932 0.3051469 0.2325625 0.2105213
##   everincreasing 0.2186974 0.2186974 0.2971282 0.3990526 0.3711941 0.3168069
##   faith          0.2556057 0.2556057 0.3568184 0.4530689 0.4431053 0.3254937
##   free           0.3642011 0.3642011 0.3844362 0.5361073 0.4881184 0.3957589
##   good           0.1812272 0.1812272 0.3545748 0.4199530 0.2901072 0.2494620
##   government     0.2439350 0.2439350 0.2525751 0.4381982 0.3674708 0.2201304
##   great          0.2615817 0.2615817 0.2603146 0.4910762 0.3971755 0.2730551
##   heading        0.0000000 0.0000000 0.2736932 0.3051469 0.2325625 0.2105213
##   hope           0.2465767 0.2465767 0.3286828 0.3762422 0.2914391 0.2287876
##   hour           0.1069114 0.1069114 0.3092995 0.2709890 0.2861276 0.2498471
##   isdst          0.0000000 0.0000000 0.2736932 0.3051469 0.2325625 0.2105213
##   justice        0.1696191 0.1696191 0.3186538 0.3691250 0.3346682 0.2202974
##   language       0.1062622 0.1062622 0.2260204 0.3824950 0.2308829 0.1931416
##   life           0.2394754 0.2394754 0.4049565 0.4155350 0.4580560 0.3979175
##   list(author    0.0000000 0.0000000 0.2736932 0.3051469 0.2325625 0.2105213
##   list(content   0.0000000 0.0000000 0.2736932 0.3051469 0.2325625 0.2105213
##   list(sec       0.0000000 0.0000000 0.2736932 0.3051469 0.2325625 0.2105213
##   manifest       0.2922017 0.2922017 0.2721808 0.4398752 0.4007022 0.2599952
##   may            0.3011136 0.3011136 0.2549571 0.5569706 0.3333063 0.2454736
##   mday           0.0000000 0.0000000 0.2736932 0.3051469 0.2325625 0.2105213
##   meta           0.0000000 0.0000000 0.2736932 0.3051469 0.2325625 0.2105213
##   min            0.0000000 0.0000000 0.2736932 0.3051469 0.2325625 0.2105213
##   mon            0.0000000 0.0000000 0.2736932 0.3051469 0.2325625 0.2105213
##   must           0.2629653 0.2629653 0.2850261 0.2334725 0.3140685 0.2257792
##   nation         0.1032149 0.1032149 0.2886530 0.3281531 0.3365034 0.1811383
##   new            0.3427149 0.3427149 0.5531409 0.2996098 0.5092031 0.3658506
##   now            0.2106103 0.2106103 0.3405962 0.3367251 0.4486164 0.2702416
##   one            0.3228135 0.3228135 0.2362842 0.3545879 0.3916919 0.1916906
##   origin         0.0000000 0.0000000 0.2736932 0.3051469 0.2325625 0.2105213
##   people         0.1857661 0.1857661 0.2411706 0.3209703 0.3662203 0.1766395
##   place          0.2582269 0.2582269 0.3491567 0.4476874 0.2768434 0.2873019
##   power          0.4383675 0.4383675 0.2950354 0.6203970 0.4339732 0.3753447
##   rights         0.2556135 0.2556135 0.1842631 0.5352416 0.4383570 0.1840578
##   secure         0.3231296 0.3231296 0.2984811 0.4466394 0.4501667 0.2436991
##   shall          0.2411517 0.2411517 0.3460512 0.5735407 0.4058363 0.3109780
##   states         0.3788196 0.3788196 0.3327275 0.6482478 0.4840780 0.3434238
##   time           0.1802393 0.1802393 0.2352651 0.2445026 0.2889587 0.2204211
##   united         0.2771348 0.2771348 0.3251816 0.5217468 0.4608861 0.3284096
##   wday           0.0000000 0.0000000 0.2736932 0.3051469 0.2325625 0.2105213
##   will           0.1510494 0.1510494 0.2372428 0.2107531 0.2874576 0.1866537
##   without        0.2247511 0.2247511 0.2963509 0.5419869 0.3459189 0.3174816
##   world          0.3116223 0.3116223 0.4817611 0.3071941 0.3923308 0.3106657
##   yday           0.0000000 0.0000000 0.2736932 0.3051469 0.2325625 0.2105213
##   year           0.0000000 0.0000000 0.2736932 0.3051469 0.2325625 0.2105213
##   make           0.2870054 0.2870054 0.2783959 0.3338100 0.3519057 0.2050114
##   peace          0.2777678 0.2777678 0.4000809 0.5405906 0.5198688 0.2803222
##                 Terms
## Terms            character(0))) character(0),  citizens countries datetimestamp
##   0),                 0.0000000     0.0000000 0.3084235 0.2694235     0.0000000
##   123,                0.0000000     0.0000000 0.3084235 0.2694235     0.0000000
##   action              0.2736932     0.2736932 0.2871853 0.2346954     0.2736932
##   american            0.3051469     0.3051469 0.3479934 0.4966198     0.3051469
##   called              0.2325625     0.2325625 0.3281584 0.3884188     0.2325625
##   can                 0.2105213     0.2105213 0.2930308 0.2823231     0.2105213
##   character(0)))      0.0000000     0.0000000 0.3084235 0.2694235     0.0000000
##   character(0),       0.0000000     0.0000000 0.3084235 0.2694235     0.0000000
##   citizens            0.3084235     0.3084235 0.0000000 0.2304740     0.3084235
##   countries           0.2694235     0.2694235 0.2304740 0.0000000     0.2694235
##   datetimestamp       0.0000000     0.0000000 0.3084235 0.2694235     0.0000000
##   description         0.0000000     0.0000000 0.3084235 0.2694235     0.0000000
##   everincreasing      0.2186974     0.2186974 0.2364246 0.1887603     0.2186974
##   faith               0.2556057     0.2556057 0.3875941 0.3562552     0.2556057
##   free                0.3642011     0.3642011 0.3498437 0.3377865     0.3642011
##   good                0.1812272     0.1812272 0.3791704 0.3623326     0.1812272
##   government          0.2439350     0.2439350 0.1988928 0.1765039     0.2439350
##   great               0.2615817     0.2615817 0.2514811 0.2318000     0.2615817
##   heading             0.0000000     0.0000000 0.3084235 0.2694235     0.0000000
##   hope                0.2465767     0.2465767 0.4133685 0.3686981     0.2465767
##   hour                0.1069114     0.1069114 0.3827955 0.3593138     0.1069114
##   isdst               0.0000000     0.0000000 0.3084235 0.2694235     0.0000000
##   justice             0.1696191     0.1696191 0.3041795 0.2421622     0.1696191
##   language            0.1062622     0.1062622 0.1799705 0.1966653     0.1062622
##   life                0.2394754     0.2394754 0.4774712 0.5104055     0.2394754
##   list(author         0.0000000     0.0000000 0.3084235 0.2694235     0.0000000
##   list(content        0.0000000     0.0000000 0.3084235 0.2694235     0.0000000
##   list(sec            0.0000000     0.0000000 0.3084235 0.2694235     0.0000000
##   manifest            0.2922017     0.2922017 0.2062496 0.2174617     0.2922017
##   may                 0.3011136     0.3011136 0.2173129 0.2059181     0.3011136
##   mday                0.0000000     0.0000000 0.3084235 0.2694235     0.0000000
##   meta                0.0000000     0.0000000 0.3084235 0.2694235     0.0000000
##   min                 0.0000000     0.0000000 0.3084235 0.2694235     0.0000000
##   mon                 0.0000000     0.0000000 0.3084235 0.2694235     0.0000000
##   must                0.2629653     0.2629653 0.3549976 0.3439585     0.2629653
##   nation              0.1032149     0.1032149 0.3259120 0.2735945     0.1032149
##   new                 0.3427149     0.3427149 0.5123780 0.6150752     0.3427149
##   now                 0.2106103     0.2106103 0.4043963 0.3925734     0.2106103
##   one                 0.3228135     0.3228135 0.2008465 0.2445306     0.3228135
##   origin              0.0000000     0.0000000 0.3084235 0.2694235     0.0000000
##   people              0.1857661     0.1857661 0.1873170 0.2255799     0.1857661
##   place               0.2582269     0.2582269 0.2398434 0.3944120     0.2582269
##   power               0.4383675     0.4383675 0.1888553 0.2758797     0.4383675
##   rights              0.2556135     0.2556135 0.2633969 0.1842227     0.2556135
##   secure              0.3231296     0.3231296 0.3062820 0.2187839     0.3231296
##   shall               0.2411517     0.2411517 0.3980996 0.3605197     0.2411517
##   states              0.3788196     0.3788196 0.2497892 0.2273409     0.3788196
##   time                0.1802393     0.1802393 0.2371404 0.3154990     0.1802393
##   united              0.2771348     0.2771348 0.2829606 0.2102188     0.2771348
##   wday                0.0000000     0.0000000 0.3084235 0.2694235     0.0000000
##   will                0.1510494     0.1510494 0.2787183 0.2481983     0.1510494
##   without             0.2247511     0.2247511 0.3153140 0.2456291     0.2247511
##   world               0.3116223     0.3116223 0.5273353 0.5510524     0.3116223
##   yday                0.0000000     0.0000000 0.3084235 0.2694235     0.0000000
##   year                0.0000000     0.0000000 0.3084235 0.2694235     0.0000000
##   make                0.2870054     0.2870054 0.3095185 0.3355558     0.2870054
##   peace               0.2777678     0.2777678 0.4883528 0.3614036     0.2777678
##                 Terms
## Terms            description everincreasing     faith      free      good
##   0),              0.0000000      0.2186974 0.2556057 0.3642011 0.1812272
##   123,             0.0000000      0.2186974 0.2556057 0.3642011 0.1812272
##   action           0.2736932      0.2971282 0.3568184 0.3844362 0.3545748
##   american         0.3051469      0.3990526 0.4530689 0.5361073 0.4199530
##   called           0.2325625      0.3711941 0.4431053 0.4881184 0.2901072
##   can              0.2105213      0.3168069 0.3254937 0.3957589 0.2494620
##   character(0)))   0.0000000      0.2186974 0.2556057 0.3642011 0.1812272
##   character(0),    0.0000000      0.2186974 0.2556057 0.3642011 0.1812272
##   citizens         0.3084235      0.2364246 0.3875941 0.3498437 0.3791704
##   countries        0.2694235      0.1887603 0.3562552 0.3377865 0.3623326
##   datetimestamp    0.0000000      0.2186974 0.2556057 0.3642011 0.1812272
##   description      0.0000000      0.2186974 0.2556057 0.3642011 0.1812272
##   everincreasing   0.2186974      0.0000000 0.3467121 0.4095602 0.3489480
##   faith            0.2556057      0.3467121 0.0000000 0.2628773 0.3992667
##   free             0.3642011      0.4095602 0.2628773 0.0000000 0.3976671
##   good             0.1812272      0.3489480 0.3992667 0.3976671 0.0000000
##   government       0.2439350      0.2038159 0.3688935 0.3795765 0.3201950
##   great            0.2615817      0.1943506 0.3875343 0.4089899 0.3073050
##   heading          0.0000000      0.2186974 0.2556057 0.3642011 0.1812272
##   hope             0.2465767      0.3480163 0.3070473 0.3520430 0.3214639
##   hour             0.1069114      0.2754873 0.3182851 0.4680480 0.2338179
##   isdst            0.0000000      0.2186974 0.2556057 0.3642011 0.1812272
##   justice          0.1696191      0.2201156 0.3163116 0.4156400 0.3017595
##   language         0.1062622      0.2358100 0.3353395 0.3657240 0.2321640
##   life             0.2394754      0.4293273 0.3590834 0.3650165 0.3510038
##   list(author      0.0000000      0.2186974 0.2556057 0.3642011 0.1812272
##   list(content     0.0000000      0.2186974 0.2556057 0.3642011 0.1812272
##   list(sec         0.0000000      0.2186974 0.2556057 0.3642011 0.1812272
##   manifest         0.2922017      0.2869856 0.4041606 0.4414995 0.3308990
##   may              0.3011136      0.2485394 0.4636186 0.3691530 0.3178149
##   mday             0.0000000      0.2186974 0.2556057 0.3642011 0.1812272
##   meta             0.0000000      0.2186974 0.2556057 0.3642011 0.1812272
##   min              0.0000000      0.2186974 0.2556057 0.3642011 0.1812272
##   mon              0.0000000      0.2186974 0.2556057 0.3642011 0.1812272
##   must             0.2629653      0.3402277 0.3610163 0.3917937 0.4144707
##   nation           0.1032149      0.2557636 0.2450432 0.3756398 0.2298995
##   new              0.3427149      0.4450541 0.5064848 0.5644522 0.4346620
##   now              0.2106103      0.3251475 0.4073692 0.4604766 0.3565571
##   one              0.3228135      0.2819428 0.4538551 0.3985835 0.3679843
##   origin           0.0000000      0.2186974 0.2556057 0.3642011 0.1812272
##   people           0.1857661      0.2313704 0.2616443 0.2635600 0.2603983
##   place            0.2582269      0.3083978 0.4320127 0.4804085 0.2777006
##   power            0.4383675      0.3350926 0.5898255 0.4132375 0.4245210
##   rights           0.2556135      0.2515484 0.3683577 0.3572963 0.3319872
##   secure           0.3231296      0.3290915 0.3212865 0.3377984 0.3973605
##   shall            0.2411517      0.3572345 0.3406215 0.3541365 0.3215939
##   states           0.3788196      0.2709801 0.4823813 0.3797843 0.4097998
##   time             0.1802393      0.2231006 0.3318286 0.3601695 0.2969035
##   united           0.2771348      0.2298496 0.3770863 0.3042528 0.3491342
##   wday             0.0000000      0.2186974 0.2556057 0.3642011 0.1812272
##   will             0.1510494      0.2012852 0.3264156 0.3822256 0.2150069
##   without          0.2247511      0.2063253 0.3829404 0.3926966 0.2865643
##   world            0.3116223      0.5269638 0.3859550 0.4178739 0.4397523
##   yday             0.0000000      0.2186974 0.2556057 0.3642011 0.1812272
##   year             0.0000000      0.2186974 0.2556057 0.3642011 0.1812272
##   make             0.2870054      0.3631464 0.3768318 0.3950963 0.2873995
##   peace            0.2777678      0.4660162 0.2734899 0.3783843 0.4085537
##                 Terms
## Terms            government     great   heading      hope      hour     isdst
##   0),             0.2439350 0.2615817 0.0000000 0.2465767 0.1069114 0.0000000
##   123,            0.2439350 0.2615817 0.0000000 0.2465767 0.1069114 0.0000000
##   action          0.2525751 0.2603146 0.2736932 0.3286828 0.3092995 0.2736932
##   american        0.4381982 0.4910762 0.3051469 0.3762422 0.2709890 0.3051469
##   called          0.3674708 0.3971755 0.2325625 0.2914391 0.2861276 0.2325625
##   can             0.2201304 0.2730551 0.2105213 0.2287876 0.2498471 0.2105213
##   character(0)))  0.2439350 0.2615817 0.0000000 0.2465767 0.1069114 0.0000000
##   character(0),   0.2439350 0.2615817 0.0000000 0.2465767 0.1069114 0.0000000
##   citizens        0.1988928 0.2514811 0.3084235 0.4133685 0.3827955 0.3084235
##   countries       0.1765039 0.2318000 0.2694235 0.3686981 0.3593138 0.2694235
##   datetimestamp   0.2439350 0.2615817 0.0000000 0.2465767 0.1069114 0.0000000
##   description     0.2439350 0.2615817 0.0000000 0.2465767 0.1069114 0.0000000
##   everincreasing  0.2038159 0.1943506 0.2186974 0.3480163 0.2754873 0.2186974
##   faith           0.3688935 0.3875343 0.2556057 0.3070473 0.3182851 0.2556057
##   free            0.3795765 0.4089899 0.3642011 0.3520430 0.4680480 0.3642011
##   good            0.3201950 0.3073050 0.1812272 0.3214639 0.2338179 0.1812272
##   government      0.0000000 0.2114394 0.2439350 0.3647836 0.3417528 0.2439350
##   great           0.2114394 0.0000000 0.2615817 0.3992243 0.3660620 0.2615817
##   heading         0.2439350 0.2615817 0.0000000 0.2465767 0.1069114 0.0000000
##   hope            0.3647836 0.3992243 0.2465767 0.0000000 0.2755973 0.2465767
##   hour            0.3417528 0.3660620 0.1069114 0.2755973 0.0000000 0.1069114
##   isdst           0.2439350 0.2615817 0.0000000 0.2465767 0.1069114 0.0000000
##   justice         0.2539866 0.2390129 0.1696191 0.2564148 0.2392147 0.1696191
##   language        0.1717172 0.2223406 0.1062622 0.2864869 0.1987643 0.1062622
##   life            0.4745677 0.4471745 0.2394754 0.3399784 0.3001830 0.2394754
##   list(author     0.2439350 0.2615817 0.0000000 0.2465767 0.1069114 0.0000000
##   list(content    0.2439350 0.2615817 0.0000000 0.2465767 0.1069114 0.0000000
##   list(sec        0.2439350 0.2615817 0.0000000 0.2465767 0.1069114 0.0000000
##   manifest        0.2048855 0.2185407 0.2922017 0.4000634 0.3350421 0.2922017
##   may             0.1706117 0.2015659 0.3011136 0.3669758 0.3952348 0.3011136
##   mday            0.2439350 0.2615817 0.0000000 0.2465767 0.1069114 0.0000000
##   meta            0.2439350 0.2615817 0.0000000 0.2465767 0.1069114 0.0000000
##   min             0.2439350 0.2615817 0.0000000 0.2465767 0.1069114 0.0000000
##   mon             0.2439350 0.2615817 0.0000000 0.2465767 0.1069114 0.0000000
##   must            0.3108007 0.3724111 0.2629653 0.2327013 0.2620640 0.2629653
##   nation          0.2547614 0.2442298 0.1032149 0.2330913 0.1618522 0.1032149
##   new             0.4996151 0.4741301 0.3427149 0.3988559 0.3662012 0.3427149
##   now             0.3111200 0.3194376 0.2106103 0.4028245 0.2850911 0.2106103
##   one             0.1779422 0.2915943 0.3228135 0.3669010 0.3317532 0.3228135
##   origin          0.2439350 0.2615817 0.0000000 0.2465767 0.1069114 0.0000000
##   people          0.1543438 0.2046470 0.1857661 0.3002582 0.2711191 0.1857661
##   place           0.2853730 0.2255405 0.2582269 0.3845113 0.3256913 0.2582269
##   power           0.2204875 0.2710177 0.4383675 0.5515264 0.5094098 0.4383675
##   rights          0.1557096 0.2274302 0.2556135 0.3197530 0.3076159 0.2556135
##   secure          0.2483849 0.2733582 0.3231296 0.2487225 0.3728258 0.3231296
##   shall           0.2641309 0.3077091 0.2411517 0.4048258 0.3378214 0.2411517
##   states          0.1479527 0.2112843 0.3788196 0.5428521 0.4834690 0.3788196
##   time            0.2475803 0.2340116 0.1802393 0.2362141 0.2417052 0.1802393
##   united          0.2031798 0.1850202 0.2771348 0.4911462 0.3679492 0.2771348
##   wday            0.2439350 0.2615817 0.0000000 0.2465767 0.1069114 0.0000000
##   will            0.2223562 0.2124028 0.1510494 0.2525104 0.1760610 0.1510494
##   without         0.2612675 0.2062573 0.2247511 0.3480220 0.3308025 0.2247511
##   world           0.5139663 0.4984612 0.3116223 0.2823261 0.3212076 0.3116223
##   yday            0.2439350 0.2615817 0.0000000 0.2465767 0.1069114 0.0000000
##   year            0.2439350 0.2615817 0.0000000 0.2465767 0.1069114 0.0000000
##   make            0.2739569 0.2929428 0.2870054 0.2333408 0.3387995 0.2870054
##   peace           0.3557125 0.3358256 0.2777678 0.3391682 0.3522379 0.2777678
##                 Terms
## Terms              justice  language      life list(author list(content
##   0),            0.1696191 0.1062622 0.2394754   0.0000000    0.0000000
##   123,           0.1696191 0.1062622 0.2394754   0.0000000    0.0000000
##   action         0.3186538 0.2260204 0.4049565   0.2736932    0.2736932
##   american       0.3691250 0.3824950 0.4155350   0.3051469    0.3051469
##   called         0.3346682 0.2308829 0.4580560   0.2325625    0.2325625
##   can            0.2202974 0.1931416 0.3979175   0.2105213    0.2105213
##   character(0))) 0.1696191 0.1062622 0.2394754   0.0000000    0.0000000
##   character(0),  0.1696191 0.1062622 0.2394754   0.0000000    0.0000000
##   citizens       0.3041795 0.1799705 0.4774712   0.3084235    0.3084235
##   countries      0.2421622 0.1966653 0.5104055   0.2694235    0.2694235
##   datetimestamp  0.1696191 0.1062622 0.2394754   0.0000000    0.0000000
##   description    0.1696191 0.1062622 0.2394754   0.0000000    0.0000000
##   everincreasing 0.2201156 0.2358100 0.4293273   0.2186974    0.2186974
##   faith          0.3163116 0.3353395 0.3590834   0.2556057    0.2556057
##   free           0.4156400 0.3657240 0.3650165   0.3642011    0.3642011
##   good           0.3017595 0.2321640 0.3510038   0.1812272    0.1812272
##   government     0.2539866 0.1717172 0.4745677   0.2439350    0.2439350
##   great          0.2390129 0.2223406 0.4471745   0.2615817    0.2615817
##   heading        0.1696191 0.1062622 0.2394754   0.0000000    0.0000000
##   hope           0.2564148 0.2864869 0.3399784   0.2465767    0.2465767
##   hour           0.2392147 0.1987643 0.3001830   0.1069114    0.1069114
##   isdst          0.1696191 0.1062622 0.2394754   0.0000000    0.0000000
##   justice        0.0000000 0.2298235 0.3127039   0.1696191    0.1696191
##   language       0.2298235 0.0000000 0.3856791   0.1062622    0.1062622
##   life           0.3127039 0.3856791 0.0000000   0.2394754    0.2394754
##   list(author    0.1696191 0.1062622 0.2394754   0.0000000    0.0000000
##   list(content   0.1696191 0.1062622 0.2394754   0.0000000    0.0000000
##   list(sec       0.1696191 0.1062622 0.2394754   0.0000000    0.0000000
##   manifest       0.2093755 0.2405289 0.4315715   0.2922017    0.2922017
##   may            0.2842671 0.1819527 0.5787241   0.3011136    0.3011136
##   mday           0.1696191 0.1062622 0.2394754   0.0000000    0.0000000
##   meta           0.1696191 0.1062622 0.2394754   0.0000000    0.0000000
##   min            0.1696191 0.1062622 0.2394754   0.0000000    0.0000000
##   mon            0.1696191 0.1062622 0.2394754   0.0000000    0.0000000
##   must           0.2590383 0.3198470 0.3624059   0.2629653    0.2629653
##   nation         0.1286761 0.1744743 0.2479912   0.1032149    0.1032149
##   new            0.4009755 0.4473816 0.3771771   0.3427149    0.3427149
##   now            0.3937696 0.2648853 0.4869662   0.2106103    0.2106103
##   one            0.3535845 0.2285222 0.5403341   0.3228135    0.3228135
##   origin         0.1696191 0.1062622 0.2394754   0.0000000    0.0000000
##   people         0.2151497 0.1722590 0.3498061   0.1857661    0.1857661
##   place          0.2891575 0.2053266 0.4903024   0.2582269    0.2582269
##   power          0.4437583 0.2180484 0.6447565   0.4383675    0.4383675
##   rights         0.2528017 0.1890911 0.4938392   0.2556135    0.2556135
##   secure         0.2700347 0.2746265 0.4486339   0.3231296    0.3231296
##   shall          0.3342515 0.2584013 0.4896929   0.2411517    0.2411517
##   states         0.3229409 0.2746730 0.6545667   0.3788196    0.3788196
##   time           0.2571278 0.1855317 0.3884673   0.1802393    0.1802393
##   united         0.3086215 0.2648455 0.4928024   0.2771348    0.2771348
##   wday           0.1696191 0.1062622 0.2394754   0.0000000    0.0000000
##   will           0.1947787 0.2052257 0.3868693   0.1510494    0.1510494
##   without        0.2237040 0.2100620 0.4439553   0.2247511    0.2247511
##   world          0.3588431 0.4259145 0.3173307   0.3116223    0.3116223
##   yday           0.1696191 0.1062622 0.2394754   0.0000000    0.0000000
##   year           0.1696191 0.1062622 0.2394754   0.0000000    0.0000000
##   make           0.3361181 0.2855817 0.3691802   0.2870054    0.2870054
##   peace          0.2410718 0.3641464 0.3628738   0.2777678    0.2777678
##                 Terms
## Terms             list(sec  manifest       may      mday      meta       min
##   0),            0.0000000 0.2922017 0.3011136 0.0000000 0.0000000 0.0000000
##   123,           0.0000000 0.2922017 0.3011136 0.0000000 0.0000000 0.0000000
##   action         0.2736932 0.2721808 0.2549571 0.2736932 0.2736932 0.2736932
##   american       0.3051469 0.4398752 0.5569706 0.3051469 0.3051469 0.3051469
##   called         0.2325625 0.4007022 0.3333063 0.2325625 0.2325625 0.2325625
##   can            0.2105213 0.2599952 0.2454736 0.2105213 0.2105213 0.2105213
##   character(0))) 0.0000000 0.2922017 0.3011136 0.0000000 0.0000000 0.0000000
##   character(0),  0.0000000 0.2922017 0.3011136 0.0000000 0.0000000 0.0000000
##   citizens       0.3084235 0.2062496 0.2173129 0.3084235 0.3084235 0.3084235
##   countries      0.2694235 0.2174617 0.2059181 0.2694235 0.2694235 0.2694235
##   datetimestamp  0.0000000 0.2922017 0.3011136 0.0000000 0.0000000 0.0000000
##   description    0.0000000 0.2922017 0.3011136 0.0000000 0.0000000 0.0000000
##   everincreasing 0.2186974 0.2869856 0.2485394 0.2186974 0.2186974 0.2186974
##   faith          0.2556057 0.4041606 0.4636186 0.2556057 0.2556057 0.2556057
##   free           0.3642011 0.4414995 0.3691530 0.3642011 0.3642011 0.3642011
##   good           0.1812272 0.3308990 0.3178149 0.1812272 0.1812272 0.1812272
##   government     0.2439350 0.2048855 0.1706117 0.2439350 0.2439350 0.2439350
##   great          0.2615817 0.2185407 0.2015659 0.2615817 0.2615817 0.2615817
##   heading        0.0000000 0.2922017 0.3011136 0.0000000 0.0000000 0.0000000
##   hope           0.2465767 0.4000634 0.3669758 0.2465767 0.2465767 0.2465767
##   hour           0.1069114 0.3350421 0.3952348 0.1069114 0.1069114 0.1069114
##   isdst          0.0000000 0.2922017 0.3011136 0.0000000 0.0000000 0.0000000
##   justice        0.1696191 0.2093755 0.2842671 0.1696191 0.1696191 0.1696191
##   language       0.1062622 0.2405289 0.1819527 0.1062622 0.1062622 0.1062622
##   life           0.2394754 0.4315715 0.5787241 0.2394754 0.2394754 0.2394754
##   list(author    0.0000000 0.2922017 0.3011136 0.0000000 0.0000000 0.0000000
##   list(content   0.0000000 0.2922017 0.3011136 0.0000000 0.0000000 0.0000000
##   list(sec       0.0000000 0.2922017 0.3011136 0.0000000 0.0000000 0.0000000
##   manifest       0.2922017 0.0000000 0.2628802 0.2922017 0.2922017 0.2922017
##   may            0.3011136 0.2628802 0.0000000 0.3011136 0.3011136 0.3011136
##   mday           0.0000000 0.2922017 0.3011136 0.0000000 0.0000000 0.0000000
##   meta           0.0000000 0.2922017 0.3011136 0.0000000 0.0000000 0.0000000
##   min            0.0000000 0.2922017 0.3011136 0.0000000 0.0000000 0.0000000
##   mon            0.0000000 0.2922017 0.3011136 0.0000000 0.0000000 0.0000000
##   must           0.2629653 0.3584185 0.3689675 0.2629653 0.2629653 0.2629653
##   nation         0.1032149 0.2473162 0.3275089 0.1032149 0.1032149 0.1032149
##   new            0.3427149 0.5421030 0.6337560 0.3427149 0.3427149 0.3427149
##   now            0.2106103 0.4029997 0.3803697 0.2106103 0.2106103 0.2106103
##   one            0.3228135 0.2794814 0.2342582 0.3228135 0.3228135 0.3228135
##   origin         0.0000000 0.2922017 0.3011136 0.0000000 0.0000000 0.0000000
##   people         0.1857661 0.2224025 0.2058312 0.1857661 0.1857661 0.1857661
##   place          0.2582269 0.3073992 0.2987367 0.2582269 0.2582269 0.2582269
##   power          0.4383675 0.2834444 0.1677866 0.4383675 0.4383675 0.4383675
##   rights         0.2556135 0.2166184 0.1960175 0.2556135 0.2556135 0.2556135
##   secure         0.3231296 0.2832253 0.2573536 0.3231296 0.3231296 0.3231296
##   shall          0.2411517 0.3279675 0.2220753 0.2411517 0.2411517 0.2411517
##   states         0.3788196 0.2117812 0.1307284 0.3788196 0.3788196 0.3788196
##   time           0.1802393 0.3394911 0.3090275 0.1802393 0.1802393 0.1802393
##   united         0.2771348 0.2877599 0.2488076 0.2771348 0.2771348 0.2771348
##   wday           0.0000000 0.2922017 0.3011136 0.0000000 0.0000000 0.0000000
##   will           0.1510494 0.2177655 0.2951557 0.1510494 0.1510494 0.1510494
##   without        0.2247511 0.2533404 0.1493131 0.2247511 0.2247511 0.2247511
##   world          0.3116223 0.5324166 0.5989574 0.3116223 0.3116223 0.3116223
##   yday           0.0000000 0.2922017 0.3011136 0.0000000 0.0000000 0.0000000
##   year           0.0000000 0.2922017 0.3011136 0.0000000 0.0000000 0.0000000
##   make           0.2870054 0.3431624 0.3398646 0.2870054 0.2870054 0.2870054
##   peace          0.2777678 0.3558002 0.4595850 0.2777678 0.2777678 0.2777678
##                 Terms
## Terms                  mon      must    nation       new       now       one
##   0),            0.0000000 0.2629653 0.1032149 0.3427149 0.2106103 0.3228135
##   123,           0.0000000 0.2629653 0.1032149 0.3427149 0.2106103 0.3228135
##   action         0.2736932 0.2850261 0.2886530 0.5531409 0.3405962 0.2362842
##   american       0.3051469 0.2334725 0.3281531 0.2996098 0.3367251 0.3545879
##   called         0.2325625 0.3140685 0.3365034 0.5092031 0.4486164 0.3916919
##   can            0.2105213 0.2257792 0.1811383 0.3658506 0.2702416 0.1916906
##   character(0))) 0.0000000 0.2629653 0.1032149 0.3427149 0.2106103 0.3228135
##   character(0),  0.0000000 0.2629653 0.1032149 0.3427149 0.2106103 0.3228135
##   citizens       0.3084235 0.3549976 0.3259120 0.5123780 0.4043963 0.2008465
##   countries      0.2694235 0.3439585 0.2735945 0.6150752 0.3925734 0.2445306
##   datetimestamp  0.0000000 0.2629653 0.1032149 0.3427149 0.2106103 0.3228135
##   description    0.0000000 0.2629653 0.1032149 0.3427149 0.2106103 0.3228135
##   everincreasing 0.2186974 0.3402277 0.2557636 0.4450541 0.3251475 0.2819428
##   faith          0.2556057 0.3610163 0.2450432 0.5064848 0.4073692 0.4538551
##   free           0.3642011 0.3917937 0.3756398 0.5644522 0.4604766 0.3985835
##   good           0.1812272 0.4144707 0.2298995 0.4346620 0.3565571 0.3679843
##   government     0.2439350 0.3108007 0.2547614 0.4996151 0.3111200 0.1779422
##   great          0.2615817 0.3724111 0.2442298 0.4741301 0.3194376 0.2915943
##   heading        0.0000000 0.2629653 0.1032149 0.3427149 0.2106103 0.3228135
##   hope           0.2465767 0.2327013 0.2330913 0.3988559 0.4028245 0.3669010
##   hour           0.1069114 0.2620640 0.1618522 0.3662012 0.2850911 0.3317532
##   isdst          0.0000000 0.2629653 0.1032149 0.3427149 0.2106103 0.3228135
##   justice        0.1696191 0.2590383 0.1286761 0.4009755 0.3937696 0.3535845
##   language       0.1062622 0.3198470 0.1744743 0.4473816 0.2648853 0.2285222
##   life           0.2394754 0.3624059 0.2479912 0.3771771 0.4869662 0.5403341
##   list(author    0.0000000 0.2629653 0.1032149 0.3427149 0.2106103 0.3228135
##   list(content   0.0000000 0.2629653 0.1032149 0.3427149 0.2106103 0.3228135
##   list(sec       0.0000000 0.2629653 0.1032149 0.3427149 0.2106103 0.3228135
##   manifest       0.2922017 0.3584185 0.2473162 0.5421030 0.4029997 0.2794814
##   may            0.3011136 0.3689675 0.3275089 0.6337560 0.3803697 0.2342582
##   mday           0.0000000 0.2629653 0.1032149 0.3427149 0.2106103 0.3228135
##   meta           0.0000000 0.2629653 0.1032149 0.3427149 0.2106103 0.3228135
##   min            0.0000000 0.2629653 0.1032149 0.3427149 0.2106103 0.3228135
##   mon            0.0000000 0.2629653 0.1032149 0.3427149 0.2106103 0.3228135
##   must           0.2629653 0.0000000 0.2554897 0.3379381 0.3446933 0.3299351
##   nation         0.1032149 0.2554897 0.0000000 0.3030352 0.2825099 0.3584450
##   new            0.3427149 0.3379381 0.3030352 0.0000000 0.4120310 0.4856952
##   now            0.2106103 0.3446933 0.2825099 0.4120310 0.0000000 0.3066310
##   one            0.3228135 0.3299351 0.3584450 0.4856952 0.3066310 0.0000000
##   origin         0.0000000 0.2629653 0.1032149 0.3427149 0.2106103 0.3228135
##   people         0.1857661 0.2584850 0.1669020 0.4197367 0.2390412 0.2590810
##   place          0.2582269 0.4075232 0.2759926 0.4570153 0.3932803 0.3450915
##   power          0.4383675 0.5589987 0.4773372 0.7006705 0.5083115 0.2243196
##   rights         0.2556135 0.3539563 0.2601587 0.5542660 0.3219160 0.1842666
##   secure         0.3231296 0.2724153 0.2574130 0.5509508 0.3697238 0.3198418
##   shall          0.2411517 0.4286209 0.3317205 0.5808853 0.3135053 0.4206556
##   states         0.3788196 0.4977840 0.3798923 0.6808303 0.3716212 0.2715666
##   time           0.1802393 0.2445813 0.1919791 0.2580735 0.2116518 0.2391018
##   united         0.2771348 0.4266492 0.2709743 0.6102369 0.2985044 0.3201302
##   wday           0.0000000 0.2629653 0.1032149 0.3427149 0.2106103 0.3228135
##   will           0.1510494 0.2096607 0.1637432 0.3055922 0.1806064 0.2305403
##   without        0.2247511 0.4055130 0.2810528 0.6078607 0.2953097 0.3580206
##   world          0.3116223 0.2280928 0.2489950 0.2142673 0.4448365 0.4723478
##   yday           0.0000000 0.2629653 0.1032149 0.3427149 0.2106103 0.3228135
##   year           0.0000000 0.2629653 0.1032149 0.3427149 0.2106103 0.3228135
##   make           0.2870054 0.2531402 0.2952891 0.3380555 0.3363157 0.2807009
##   peace          0.2777678 0.3619373 0.2081131 0.4036871 0.4303305 0.4520582
##                 Terms
## Terms               origin    people     place     power    rights    secure
##   0),            0.0000000 0.1857661 0.2582269 0.4383675 0.2556135 0.3231296
##   123,           0.0000000 0.1857661 0.2582269 0.4383675 0.2556135 0.3231296
##   action         0.2736932 0.2411706 0.3491567 0.2950354 0.1842631 0.2984811
##   american       0.3051469 0.3209703 0.4476874 0.6203970 0.5352416 0.4466394
##   called         0.2325625 0.3662203 0.2768434 0.4339732 0.4383570 0.4501667
##   can            0.2105213 0.1766395 0.2873019 0.3753447 0.1840578 0.2436991
##   character(0))) 0.0000000 0.1857661 0.2582269 0.4383675 0.2556135 0.3231296
##   character(0),  0.0000000 0.1857661 0.2582269 0.4383675 0.2556135 0.3231296
##   citizens       0.3084235 0.1873170 0.2398434 0.1888553 0.2633969 0.3062820
##   countries      0.2694235 0.2255799 0.3944120 0.2758797 0.1842227 0.2187839
##   datetimestamp  0.0000000 0.1857661 0.2582269 0.4383675 0.2556135 0.3231296
##   description    0.0000000 0.1857661 0.2582269 0.4383675 0.2556135 0.3231296
##   everincreasing 0.2186974 0.2313704 0.3083978 0.3350926 0.2515484 0.3290915
##   faith          0.2556057 0.2616443 0.4320127 0.5898255 0.3683577 0.3212865
##   free           0.3642011 0.2635600 0.4804085 0.4132375 0.3572963 0.3377984
##   good           0.1812272 0.2603983 0.2777006 0.4245210 0.3319872 0.3973605
##   government     0.2439350 0.1543438 0.2853730 0.2204875 0.1557096 0.2483849
##   great          0.2615817 0.2046470 0.2255405 0.2710177 0.2274302 0.2733582
##   heading        0.0000000 0.1857661 0.2582269 0.4383675 0.2556135 0.3231296
##   hope           0.2465767 0.3002582 0.3845113 0.5515264 0.3197530 0.2487225
##   hour           0.1069114 0.2711191 0.3256913 0.5094098 0.3076159 0.3728258
##   isdst          0.0000000 0.1857661 0.2582269 0.4383675 0.2556135 0.3231296
##   justice        0.1696191 0.2151497 0.2891575 0.4437583 0.2528017 0.2700347
##   language       0.1062622 0.1722590 0.2053266 0.2180484 0.1890911 0.2746265
##   life           0.2394754 0.3498061 0.4903024 0.6447565 0.4938392 0.4486339
##   list(author    0.0000000 0.1857661 0.2582269 0.4383675 0.2556135 0.3231296
##   list(content   0.0000000 0.1857661 0.2582269 0.4383675 0.2556135 0.3231296
##   list(sec       0.0000000 0.1857661 0.2582269 0.4383675 0.2556135 0.3231296
##   manifest       0.2922017 0.2224025 0.3073992 0.2834444 0.2166184 0.2832253
##   may            0.3011136 0.2058312 0.2987367 0.1677866 0.1960175 0.2573536
##   mday           0.0000000 0.1857661 0.2582269 0.4383675 0.2556135 0.3231296
##   meta           0.0000000 0.1857661 0.2582269 0.4383675 0.2556135 0.3231296
##   min            0.0000000 0.1857661 0.2582269 0.4383675 0.2556135 0.3231296
##   mon            0.0000000 0.1857661 0.2582269 0.4383675 0.2556135 0.3231296
##   must           0.2629653 0.2584850 0.4075232 0.5589987 0.3539563 0.2724153
##   nation         0.1032149 0.1669020 0.2759926 0.4773372 0.2601587 0.2574130
##   new            0.3427149 0.4197367 0.4570153 0.7006705 0.5542660 0.5509508
##   now            0.2106103 0.2390412 0.3932803 0.5083115 0.3219160 0.3697238
##   one            0.3228135 0.2590810 0.3450915 0.2243196 0.1842666 0.3198418
##   origin         0.0000000 0.1857661 0.2582269 0.4383675 0.2556135 0.3231296
##   people         0.1857661 0.0000000 0.2829539 0.2970615 0.2272488 0.2491841
##   place          0.2582269 0.2829539 0.0000000 0.3156333 0.3022974 0.4010950
##   power          0.4383675 0.2970615 0.3156333 0.0000000 0.2331547 0.4233339
##   rights         0.2556135 0.2272488 0.3022974 0.2331547 0.0000000 0.2228500
##   secure         0.3231296 0.2491841 0.4010950 0.4233339 0.2228500 0.0000000
##   shall          0.2411517 0.2420883 0.4091058 0.4111691 0.2444715 0.3287758
##   states         0.3788196 0.2479222 0.3288980 0.1773186 0.1672523 0.3129164
##   time           0.1802393 0.2131327 0.2347682 0.3826398 0.2836233 0.3023304
##   united         0.2771348 0.2500864 0.3126846 0.2896242 0.2452104 0.2763979
##   wday           0.0000000 0.1857661 0.2582269 0.4383675 0.2556135 0.3231296
##   will           0.1510494 0.1898189 0.2864015 0.4248998 0.2249573 0.2435735
##   without        0.2247511 0.2229675 0.3225104 0.3050253 0.2761227 0.2710091
##   world          0.3116223 0.3549730 0.4343980 0.7214903 0.5393091 0.4748047
##   yday           0.0000000 0.1857661 0.2582269 0.4383675 0.2556135 0.3231296
##   year           0.0000000 0.1857661 0.2582269 0.4383675 0.2556135 0.3231296
##   make           0.2870054 0.2670512 0.3700452 0.4641451 0.3209995 0.2132659
##   peace          0.2777678 0.3071600 0.4377180 0.5842944 0.3412417 0.3314323
##                 Terms
## Terms                shall    states      time    united      wday      will
##   0),            0.2411517 0.3788196 0.1802393 0.2771348 0.0000000 0.1510494
##   123,           0.2411517 0.3788196 0.1802393 0.2771348 0.0000000 0.1510494
##   action         0.3460512 0.3327275 0.2352651 0.3251816 0.2736932 0.2372428
##   american       0.5735407 0.6482478 0.2445026 0.5217468 0.3051469 0.2107531
##   called         0.4058363 0.4840780 0.2889587 0.4608861 0.2325625 0.2874576
##   can            0.3109780 0.3434238 0.2204211 0.3284096 0.2105213 0.1866537
##   character(0))) 0.2411517 0.3788196 0.1802393 0.2771348 0.0000000 0.1510494
##   character(0),  0.2411517 0.3788196 0.1802393 0.2771348 0.0000000 0.1510494
##   citizens       0.3980996 0.2497892 0.2371404 0.2829606 0.3084235 0.2787183
##   countries      0.3605197 0.2273409 0.3154990 0.2102188 0.2694235 0.2481983
##   datetimestamp  0.2411517 0.3788196 0.1802393 0.2771348 0.0000000 0.1510494
##   description    0.2411517 0.3788196 0.1802393 0.2771348 0.0000000 0.1510494
##   everincreasing 0.3572345 0.2709801 0.2231006 0.2298496 0.2186974 0.2012852
##   faith          0.3406215 0.4823813 0.3318286 0.3770863 0.2556057 0.3264156
##   free           0.3541365 0.3797843 0.3601695 0.3042528 0.3642011 0.3822256
##   good           0.3215939 0.4097998 0.2969035 0.3491342 0.1812272 0.2150069
##   government     0.2641309 0.1479527 0.2475803 0.2031798 0.2439350 0.2223562
##   great          0.3077091 0.2112843 0.2340116 0.1850202 0.2615817 0.2124028
##   heading        0.2411517 0.3788196 0.1802393 0.2771348 0.0000000 0.1510494
##   hope           0.4048258 0.5428521 0.2362141 0.4911462 0.2465767 0.2525104
##   hour           0.3378214 0.4834690 0.2417052 0.3679492 0.1069114 0.1760610
##   isdst          0.2411517 0.3788196 0.1802393 0.2771348 0.0000000 0.1510494
##   justice        0.3342515 0.3229409 0.2571278 0.3086215 0.1696191 0.1947787
##   language       0.2584013 0.2746730 0.1855317 0.2648455 0.1062622 0.2052257
##   life           0.4896929 0.6545667 0.3884673 0.4928024 0.2394754 0.3868693
##   list(author    0.2411517 0.3788196 0.1802393 0.2771348 0.0000000 0.1510494
##   list(content   0.2411517 0.3788196 0.1802393 0.2771348 0.0000000 0.1510494
##   list(sec       0.2411517 0.3788196 0.1802393 0.2771348 0.0000000 0.1510494
##   manifest       0.3279675 0.2117812 0.3394911 0.2877599 0.2922017 0.2177655
##   may            0.2220753 0.1307284 0.3090275 0.2488076 0.3011136 0.2951557
##   mday           0.2411517 0.3788196 0.1802393 0.2771348 0.0000000 0.1510494
##   meta           0.2411517 0.3788196 0.1802393 0.2771348 0.0000000 0.1510494
##   min            0.2411517 0.3788196 0.1802393 0.2771348 0.0000000 0.1510494
##   mon            0.2411517 0.3788196 0.1802393 0.2771348 0.0000000 0.1510494
##   must           0.4286209 0.4977840 0.2445813 0.4266492 0.2629653 0.2096607
##   nation         0.3317205 0.3798923 0.1919791 0.2709743 0.1032149 0.1637432
##   new            0.5808853 0.6808303 0.2580735 0.6102369 0.3427149 0.3055922
##   now            0.3135053 0.3716212 0.2116518 0.2985044 0.2106103 0.1806064
##   one            0.4206556 0.2715666 0.2391018 0.3201302 0.3228135 0.2305403
##   origin         0.2411517 0.3788196 0.1802393 0.2771348 0.0000000 0.1510494
##   people         0.2420883 0.2479222 0.2131327 0.2500864 0.1857661 0.1898189
##   place          0.4091058 0.3288980 0.2347682 0.3126846 0.2582269 0.2864015
##   power          0.4111691 0.1773186 0.3826398 0.2896242 0.4383675 0.4248998
##   rights         0.2444715 0.1672523 0.2836233 0.2452104 0.2556135 0.2249573
##   secure         0.3287758 0.3129164 0.3023304 0.2763979 0.3231296 0.2435735
##   shall          0.0000000 0.2378221 0.3764621 0.3266657 0.2411517 0.2853986
##   states         0.2378221 0.0000000 0.3583509 0.1663374 0.3788196 0.3194047
##   time           0.3764621 0.3583509 0.0000000 0.3135349 0.1802393 0.1662107
##   united         0.3266657 0.1663374 0.3135349 0.0000000 0.2771348 0.2757858
##   wday           0.2411517 0.3788196 0.1802393 0.2771348 0.0000000 0.1510494
##   will           0.2853986 0.3194047 0.1662107 0.2757858 0.1510494 0.0000000
##   without        0.2536850 0.2140895 0.3415666 0.2547703 0.2247511 0.2448891
##   world          0.5614315 0.7254198 0.3115638 0.5568450 0.3116223 0.3525221
##   yday           0.2411517 0.3788196 0.1802393 0.2771348 0.0000000 0.1510494
##   year           0.2411517 0.3788196 0.1802393 0.2771348 0.0000000 0.1510494
##   make           0.3495582 0.4247068 0.2380485 0.3809815 0.2870054 0.1688238
##   peace          0.3791968 0.4655269 0.3369301 0.3665685 0.2777678 0.3447415
##                 Terms
## Terms              without     world      yday      year      make     peace
##   0),            0.2247511 0.3116223 0.0000000 0.0000000 0.2870054 0.2777678
##   123,           0.2247511 0.3116223 0.0000000 0.0000000 0.2870054 0.2777678
##   action         0.2963509 0.4817611 0.2736932 0.2736932 0.2783959 0.4000809
##   american       0.5419869 0.3071941 0.3051469 0.3051469 0.3338100 0.5405906
##   called         0.3459189 0.3923308 0.2325625 0.2325625 0.3519057 0.5198688
##   can            0.3174816 0.3106657 0.2105213 0.2105213 0.2050114 0.2803222
##   character(0))) 0.2247511 0.3116223 0.0000000 0.0000000 0.2870054 0.2777678
##   character(0),  0.2247511 0.3116223 0.0000000 0.0000000 0.2870054 0.2777678
##   citizens       0.3153140 0.5273353 0.3084235 0.3084235 0.3095185 0.4883528
##   countries      0.2456291 0.5510524 0.2694235 0.2694235 0.3355558 0.3614036
##   datetimestamp  0.2247511 0.3116223 0.0000000 0.0000000 0.2870054 0.2777678
##   description    0.2247511 0.3116223 0.0000000 0.0000000 0.2870054 0.2777678
##   everincreasing 0.2063253 0.5269638 0.2186974 0.2186974 0.3631464 0.4660162
##   faith          0.3829404 0.3859550 0.2556057 0.2556057 0.3768318 0.2734899
##   free           0.3926966 0.4178739 0.3642011 0.3642011 0.3950963 0.3783843
##   good           0.2865643 0.4397523 0.1812272 0.1812272 0.2873995 0.4085537
##   government     0.2612675 0.5139663 0.2439350 0.2439350 0.2739569 0.3557125
##   great          0.2062573 0.4984612 0.2615817 0.2615817 0.2929428 0.3358256
##   heading        0.2247511 0.3116223 0.0000000 0.0000000 0.2870054 0.2777678
##   hope           0.3480220 0.2823261 0.2465767 0.2465767 0.2333408 0.3391682
##   hour           0.3308025 0.3212076 0.1069114 0.1069114 0.3387995 0.3522379
##   isdst          0.2247511 0.3116223 0.0000000 0.0000000 0.2870054 0.2777678
##   justice        0.2237040 0.3588431 0.1696191 0.1696191 0.3361181 0.2410718
##   language       0.2100620 0.4259145 0.1062622 0.1062622 0.2855817 0.3641464
##   life           0.4439553 0.3173307 0.2394754 0.2394754 0.3691802 0.3628738
##   list(author    0.2247511 0.3116223 0.0000000 0.0000000 0.2870054 0.2777678
##   list(content   0.2247511 0.3116223 0.0000000 0.0000000 0.2870054 0.2777678
##   list(sec       0.2247511 0.3116223 0.0000000 0.0000000 0.2870054 0.2777678
##   manifest       0.2533404 0.5324166 0.2922017 0.2922017 0.3431624 0.3558002
##   may            0.1493131 0.5989574 0.3011136 0.3011136 0.3398646 0.4595850
##   mday           0.2247511 0.3116223 0.0000000 0.0000000 0.2870054 0.2777678
##   meta           0.2247511 0.3116223 0.0000000 0.0000000 0.2870054 0.2777678
##   min            0.2247511 0.3116223 0.0000000 0.0000000 0.2870054 0.2777678
##   mon            0.2247511 0.3116223 0.0000000 0.0000000 0.2870054 0.2777678
##   must           0.4055130 0.2280928 0.2629653 0.2629653 0.2531402 0.3619373
##   nation         0.2810528 0.2489950 0.1032149 0.1032149 0.2952891 0.2081131
##   new            0.6078607 0.2142673 0.3427149 0.3427149 0.3380555 0.4036871
##   now            0.2953097 0.4448365 0.2106103 0.2106103 0.3363157 0.4303305
##   one            0.3580206 0.4723478 0.3228135 0.3228135 0.2807009 0.4520582
##   origin         0.2247511 0.3116223 0.0000000 0.0000000 0.2870054 0.2777678
##   people         0.2229675 0.3549730 0.1857661 0.1857661 0.2670512 0.3071600
##   place          0.3225104 0.4343980 0.2582269 0.2582269 0.3700452 0.4377180
##   power          0.3050253 0.7214903 0.4383675 0.4383675 0.4641451 0.5842944
##   rights         0.2761227 0.5393091 0.2556135 0.2556135 0.3209995 0.3412417
##   secure         0.2710091 0.4748047 0.3231296 0.3231296 0.2132659 0.3314323
##   shall          0.2536850 0.5614315 0.2411517 0.2411517 0.3495582 0.3791968
##   states         0.2140895 0.7254198 0.3788196 0.3788196 0.4247068 0.4655269
##   time           0.3415666 0.3115638 0.1802393 0.1802393 0.2380485 0.3369301
##   united         0.2547703 0.5568450 0.2771348 0.2771348 0.3809815 0.3665685
##   wday           0.2247511 0.3116223 0.0000000 0.0000000 0.2870054 0.2777678
##   will           0.2448891 0.3525221 0.1510494 0.1510494 0.1688238 0.3447415
##   without        0.0000000 0.5924759 0.2247511 0.2247511 0.3499942 0.4513866
##   world          0.5924759 0.0000000 0.3116223 0.3116223 0.3554393 0.2297893
##   yday           0.2247511 0.3116223 0.0000000 0.0000000 0.2870054 0.2777678
##   year           0.2247511 0.3116223 0.0000000 0.0000000 0.2870054 0.2777678
##   make           0.3499942 0.3554393 0.2870054 0.2870054 0.0000000 0.3423202
##   peace          0.4513866 0.2297893 0.2777678 0.2777678 0.3423202 0.0000000

6 Conclusion

Firstly, it is worth mentioning our understanding behind the most used terms. As shown above, the first most frequent word was “WILL”. We believe, from our general knowledge as well, that this is a significant word in politician speeches. Politicians make promises, and when one promises he/she usually uses the future tense. We also notice that words like: Govern which is the stem for government, state and nation are frequently used. Again these are typical words from a state leader and we also expected them to be used frequently.

Secondly, regarding the document similarity we used two different measurement methods: respectively Jaccard Similarity and Ratio of Matches. They both presented low scores as the result, but at least for the first ranked pairs the results from Ratio of Matches are twice as high in comparison with the first method. //